Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
1.
Sci Adv ; 10(16): eadg2488, 2024 Apr 19.
Artigo em Inglês | MEDLINE | ID: mdl-38640235

RESUMO

Humans learn concepts both from labeled supervision and by unsupervised observation of patterns, a process machines are being taught to mimic by training on large annotated datasets-a method quite different from the human pathway, wherein few examples with no supervision suffice to induce an unfamiliar relational concept. We introduce a computational model designed to emulate human inductive reasoning on abstract reasoning tasks, such as those in IQ tests, using a minimax entropy approach. This method combines identifying the most effective constraints on data via minimum entropy with determining the best combination of them via maximum entropy. Our model, which applies this unsupervised technique, induces concepts from just one instance, reaching human-level performance on tasks of Raven's Progressive Matrices (RPM), Machine Number Sense (MNS), and Odd-One-Out (O3). These results demonstrate the potential of minimax entropy learning for enabling machines to learn relational concepts efficiently with minimal input.

2.
Sci Robot ; 7(68): eabm4183, 2022 07 13.
Artigo em Inglês | MEDLINE | ID: mdl-35857532

RESUMO

A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users' intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment-collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users' values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems.


Assuntos
Robótica , Inteligência Artificial , Comunicação , Retroalimentação , Humanos , Sistemas Homem-Máquina
3.
iScience ; 25(1): 103581, 2022 Jan 21.
Artigo em Inglês | MEDLINE | ID: mdl-35036861

RESUMO

We propose CX-ToM, short for counterfactual explanations with theory-of-mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e., dialogue between the machine and human user. More concretely, our CX-ToM framework generates a sequence of explanations in a dialogue by mediating the differences between the minds of the machine and human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling the human's intention, the machine's mind as inferred by the human, as well as human's mind as inferred by the machine. Moreover, most state-of-the-art XAI frameworks provide attention (or heat map) based explanations. In our work, we show that these attention-based explanations are not sufficient for increasing human trust in the underlying CNN model. In CX-ToM, we instead use counterfactual explanations called fault-lines which we define as follows: given an input image I for which a CNN classification model M predicts class c pred , a fault-line identifies the minimal semantic-level features (e.g., stripes on zebra), referred to as explainable concepts, that need to be added to or deleted from I to alter the classification category of I by M to another specified class c alt . Extensive experiments verify our hypotheses, demonstrating that our CX-ToM significantly outperforms the state-of-the-art XAI models.

4.
IEEE Trans Pattern Anal Mach Intell ; 44(3): 1162-1179, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32749961

RESUMO

We present a deformable generator model to disentangle the appearance and geometric information for both image and video data in a purely unsupervised manner. The appearance generator network models the information related to appearance, including color, illumination, identity or category, while the geometric generator performs geometric warping, such as rotation and stretching, through generating deformation field which is used to warp the generated appearance to obtain the final image or video sequences. Two generators take independent latent vectors as input to disentangle the appearance and geometric information from image or video sequences. For video data, a nonlinear transition model is introduced to both the appearance and geometric generators to capture the dynamics over time. The proposed scheme is general and can be easily integrated into different generative models. An extensive set of qualitative and quantitative experiments shows that the appearance and geometric information can be well disentangled, and the learned geometric generator can be conveniently transferred to other image datasets that share similar structure regularity to facilitate knowledge transfer tasks.

5.
IEEE Trans Pattern Anal Mach Intell ; 44(8): 3957-3973, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-33769930

RESUMO

This paper studies the problem of learning the conditional distribution of a high-dimensional output given an input, where the output and input may belong to two different domains, e.g., the output is a photo image and the input is a sketch image. We solve this problem by cooperative training of a fast thinking initializer and slow thinking solver. The initializer generates the output directly by a non-linear transformation of the input as well as a noise vector that accounts for latent variability in the output. The slow thinking solver learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. We propose to learn the two models jointly, where the fast thinking initializer serves to initialize the sampling of the slow thinking solver, and the solver refines the initial output by an iterative algorithm. The solver learns from the difference between the refined output and the observed output, while the initializer learns from how the solver refines its initial output. We demonstrate the effectiveness of the proposed method on various conditional learning tasks, e.g., class-to-image generation, image-to-image translation, and image recovery. The advantage of our method over GAN-based methods is that our method is equipped with a slow thinking process that refines the solution guided by a learned objective function.


Assuntos
Algoritmos
6.
IEEE Trans Pattern Anal Mach Intell ; 44(5): 2468-2484, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-33320811

RESUMO

3D data that contains rich geometry information of objects and scenes is valuable for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a deep 3D energy-based model to represent volumetric shapes. The maximum likelihood training of the model follows an "analysis by synthesis" scheme. The benefits of the proposed model are six-fold: first, unlike GANs and VAEs, the model training does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by Markov chain Monte Carlo (MCMC); third, the conditional model can be applied to 3D object recovery and super resolution; fourth, the model can serve as a building block in a multi-grid modeling and sampling framework for high resolution 3D shape synthesis; fifth, the model can be used to train a 3D generator via MCMC teaching; sixth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which is useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.

7.
IEEE Trans Pattern Anal Mach Intell ; 44(6): 2827-2840, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-33400648

RESUMO

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images. Considering the intrinsic complexity and structural nature of the task, we introduce a cascaded parsing network (CP-HOI) for a multi-stage, structured HOI understanding. At each cascade stage, an instance detection module progressively refines HOI proposals and feeds them into a structured interaction reasoning module. Each of the two modules is also connected to its predecessor in the previous stage, enabling efficient cross-stage information propagation. The structured interaction reasoning module is built upon a graph parsing neural network (GPNN), which efficiently models potential HOI structures as graphs and mines rich context for comprehensive relation understanding. In particular, GPNN infers a parse graph that i) interprets meaningful HOI structures by a learnable adjacency matrix, and ii) predicts action (edge) labels. Within an end-to-end, message-passing framework, GPNN blends learning and inference, iteratively parsing HOI structures and reasoning HOI representations (i.e., instance and relation features). Further beyond relation detection at a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. A preliminary version of our CP-HOI model reached 1st place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation. In addition, our CP-HOI shows promising results on two popular HOI recognition benchmarks, i.e., V-COCO and HICO-DET.


Assuntos
Algoritmos , Redes Neurais de Computação , Humanos , Aprendizagem , Percepção Visual
8.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6327-6344, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-34106844

RESUMO

In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation from a monocular RGB image. Our model takes estimated 2D pose as the input and learns a generalized 2D-3D mapping function to leverage into 3D pose. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNNs) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a data augmentation algorithm to further improve model robustness against appearance variations and cross-view generalization ability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.


Assuntos
Algoritmos , Postura , Fenômenos Biomecânicos , Humanos
9.
IEEE Trans Pattern Anal Mach Intell ; 44(7): 3508-3522, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-33513100

RESUMO

Modeling the human structure is central for human parsing that extracts pixel-wise semantic information from images. We start with analyzing three types of inference processes over the hierarchical structure of human bodies: direct inference (directly predicting human semantic parts using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). We then formulate the problem as a compositional neural information fusion (CNIF) framework, which assembles the information from the three inference processes in a conditional manner, i.e., considering the confidence of the sources. Based on CNIF, we further present a part-relation-aware human parser (PRHP), which precisely describes three kinds of human part relations, i.e., decomposition, composition, and dependency, by three distinct relation networks. Expressive relation information can be captured by imposing the parameters in the relation networks to satisfy specific geometric characteristics of different relations. By assimilating generic message-passing networks with their edge-typed, convolutional counterparts, PRHP performs iterative reasoning over the human body hierarchy. With these efforts, PRHP provides a more general and powerful form of CNIF, and lays the foundation for more sophisticated and flexible human relation patterns of reasoning. Experiments on five datasets demonstrate that our two human parsers outperform the state-of-the-arts in all cases.


Assuntos
Algoritmos , Semântica , Humanos , Software
10.
Cogn Psychol ; 128: 101398, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34217107

RESUMO

One of the great feats of human perception is the generation of quick impressions of both physical and social events based on sparse displays of motion trajectories. Here we aim to provide a unified theory that captures the interconnections between perception of physical and social events. A simulation-based approach is used to generate a variety of animations depicting rich behavioral patterns. Human experiments used these animations to reveal that perception of dynamic stimuli undergoes a gradual transition from physical to social events. A learning-based computational framework is proposed to account for human judgments. The model learns to identify latent forces by inferring a family of potential functions capturing physical laws, and value functions describing the goals of agents. The model projects new animations into a sociophysical space with two psychological dimensions: an intuitive sense of whether physical laws are violated, and an impression of whether an agent possesses intentions to perform goal-directed actions. This derived sociophysical space predicts a meaningful partition between physical and social events, as well as a gradual transition from physical to social perception. The space also predicts human judgments of whether individual objects are lifeless objects in motion, or human agents performing goal-directed actions. These results demonstrate that a theoretical unification based on physical potential functions and goal-related values can account for the human ability to form an immediate impression of physical and social events. This ability provides an important pathway from perception to higher cognition.


Assuntos
Cognição , Julgamento , Humanos , Intenção , Motivação , Percepção Social
11.
IEEE Trans Pattern Anal Mach Intell ; 43(2): 516-531, 2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-31425020

RESUMO

Video sequences contain rich dynamic patterns, such as dynamic texture patterns that exhibit stationarity in the temporal domain, and action patterns that are non-stationary in either spatial or temporal domain. We show that an energy-based spatial-temporal generative ConvNet can be used to model and synthesize dynamic patterns. The model defines a probability distribution on the video sequence, and the log probability is defined by a spatial-temporal ConvNet that consists of multiple layers of spatial-temporal filters to capture spatial-temporal patterns of different scales. The model can be learned from the training video sequences by an "analysis by synthesis" learning algorithm that iterates the following two steps. Step 1 synthesizes video sequences from the currently learned model. Step 2 then updates the model parameters based on the difference between the synthesized video sequences and the observed training sequences. We show that the learning algorithm can synthesize realistic dynamic patterns. We also show that it is possible to learn the model from incomplete training sequences with either occluded pixels or missing frames, so that model learning and pattern completion can be accomplished simultaneously.

12.
IEEE Trans Pattern Anal Mach Intell ; 43(8): 2538-2554, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-32142420

RESUMO

Detection, parsing, and future predictions on sequence data (e.g., videos) require the algorithms to capture non-Markovian and compositional properties of high-level semantics. Context-free grammars are natural choices to capture such properties, but traditional grammar parsers (e.g., Earley parser) only take symbolic sentences as inputs. In this paper, we generalize the Earley parser to parse sequence data which is neither segmented nor labeled. Given the output of an arbitrary probabilistic classifier, this generalized Earley parser finds the optimal segmentation and labels in the language defined by the input grammar. Based on the parsing results, it makes top-down future predictions. The proposed method is generic, principled, and widely applicable. Experiment results clearly show the benefit of our method for both human activity parsing and prediction on three video datasets.


Assuntos
Algoritmos , Software , Atividades Humanas , Humanos , Semântica
13.
IEEE Trans Pattern Anal Mach Intell ; 43(10): 3416-3431, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-32224452

RESUMO

This paper proposes a generic method to learn interpretable convolutional filters in a deep convolutional neural network (CNN) for object classification, where each interpretable filter encodes features of a specific object part. Our method does not require additional annotations of object parts or textures for supervision. Instead, we use the same training data as traditional CNNs. Our method automatically assigns each interpretable filter in a high conv-layer with an object part of a certain category during the learning process. Such explicit knowledge representations in conv-layers of the CNN help people clarify the logic encoded in the CNN, i.e., answering what patterns the CNN extracts from an input image and uses for prediction. We have tested our method using different benchmark CNNs with various architectures to demonstrate the broad applicability of our method. Experiments have shown that our interpretable filters are much more semantically meaningful than traditional filters.

14.
IEEE Trans Pattern Anal Mach Intell ; 43(11): 3863-3877, 2021 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-32386138

RESUMO

This paper introduces an explanatory graph representation to reveal object parts encoded inside convolutional layers of a CNN. Given a pre-trained CNN, each filter1 in a conv-layer usually represents a mixture of object parts. We develop a simple yet effective method to learn an explanatory graph, which automatically disentangles object parts from each filter without any part annotations. Specifically, given the feature map of a filter, we mine neural activations from the feature map, which correspond to different object parts. The explanatory graph is constructed to organize each mined part as a graph node. Each edge connects two nodes, whose corresponding object parts usually co-activate and keep a stable spatial relationship. Experiments show that each graph node consistently represented the same object part through different images, which boosted the transferability of CNN features. The explanatory graph transferred features of object parts to the task of part localization, and our method significantly outperformed other approaches.

15.
IEEE Trans Pattern Anal Mach Intell ; 43(11): 3949-3963, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-32396071

RESUMO

In this paper, we present a method to mine object-part patterns from conv-layers of a pre-trained convolutional neural network (CNN). The mined object-part patterns are organized by an And-Or graph (AOG). This interpretable AOG representation consists of a four-layer semantic hierarchy, i.e., semantic parts, part templates, latent patterns, and neural units. The AOG associates each object part with certain neural units in feature maps of conv-layers. The AOG is constructed with very few annotations (e.g., 3-20) of object parts. We develop a question-answering (QA) method that uses active human-computer communications to mine patterns from a pre-trained CNN, in order to explain features in conv-layers incrementally. During the learning process, our QA method uses the current AOG for part localization. The QA method actively identifies objects, whose feature maps cannot be explained by the AOG. Then, our method asks people to annotate parts on the unexplained objects, and uses answers to discover CNN patterns corresponding to newly labeled parts. In this way, our method gradually grows new branches and refines existing branches on the AOG to semanticize CNN representations. In experiments, our method exhibited a high learning efficiency. Our method used about 1/6- 1/3 of the part annotations for training, but achieved similar or better part-localization performance than fast-RCNN methods.

16.
IEEE Trans Pattern Anal Mach Intell ; 42(1): 27-45, 2020 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-30387724

RESUMO

This paper studies the cooperative training of two generative models for image modeling and synthesis. Both models are parametrized by convolutional neural networks (ConvNets). The first model is a deep energy-based model, whose energy function is defined by a bottom-up ConvNet, which maps the observed image to the energy. We call it the descriptor network. The second model is a generator network, which is a non-linear version of factor analysis. It is defined by a top-down ConvNet, which maps the latent factors to the observed image. The maximum likelihood learning algorithms of both models involve MCMC sampling such as Langevin dynamics. We observe that the two learning algorithms can be seamlessly interwoven into a cooperative learning algorithm that can train both models simultaneously. Specifically, within each iteration of the cooperative learning algorithm, the generator model generates initial synthesized examples to initialize a finite-step MCMC that samples and trains the energy-based descriptor model. After that, the generator model learns from how the MCMC changes its synthesized examples. That is, the descriptor model teaches the generator model by MCMC, so that the generator model accumulates the MCMC transitions and reproduces them by direct ancestral sampling. We call this scheme MCMC teaching. We show that the cooperative algorithm can learn highly realistic generative models.

17.
Sci Robot ; 4(37)2019 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-33137717

RESUMO

The ability to provide comprehensive explanations of chosen actions is a hallmark of intelligence. Lack of this ability impedes the general acceptance of AI and robot systems in critical tasks. This paper examines what forms of explanations best foster human trust in machines and proposes a framework in which explanations are generated from both functional and mechanistic perspectives. The robot system learns from human demonstrations to open medicine bottles using (i) an embodied haptic prediction model to extract knowledge from sensory feedback, (ii) a stochastic grammar model induced to capture the compositional structure of a multistep task, and (iii) an improved Earley parsing algorithm to jointly leverage both the haptic and grammar models. The robot system not only shows the ability to learn from human demonstrators but also succeeds in opening new, unseen bottles. Using different forms of explanations generated by the robot system, we conducted a psychological experiment to examine what forms of explanations best foster human trust in the robot. We found that comprehensive and real-time visualizations of the robot's internal decisions were more effective in promoting human trust than explanations based on summary text descriptions. In addition, forms of explanation that are best suited to foster trust do not necessarily correspond to the model components contributing to the best task performance. This divergence shows a need for the robotics community to integrate model components to enhance both task execution and human trust in machines.

18.
IEEE Trans Pattern Anal Mach Intell ; 40(7): 1639-1652, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-28727549

RESUMO

This paper presents a method for localizing functional objects and predicting human intents and trajectories in surveillance videos of public spaces, under no supervision in training. People in public spaces are expected to intentionally take shortest paths (subject to obstacles) toward certain objects (e.g., vending machine, picnic table, dumpster etc.) where they can satisfy certain needs (e.g., quench thirst). Since these objects are typically very small or heavily occluded, they cannot be inferred by their visual appearance but indirectly by their influence on people's trajectories. Therefore, we call them "dark matter", by analogy to cosmology, since their presence can only be observed as attractive or repulsive "fields" in the public space. A person in the scene is modeled as an intelligent agent engaged in one of the "fields" selected depending his/her intent. An agent's trajectory is derived from an Agent-based Lagrangian Mechanics. The agents can change their intents in the middle of motion and thus alter the trajectory. For evaluation, we compiled and annotated a new dataset. The results demonstrate our effectiveness in predicting human intent behaviors and trajectories, and localizing and discovering distinct types of "dark matter" in wide public spaces.


Assuntos
Atividades Humanas/classificação , Processamento de Imagem Assistida por Computador/métodos , Intenção , Reconhecimento Automatizado de Padrão/métodos , Gravação em Vídeo/métodos , Análise por Conglomerados , Bases de Dados Factuais , Humanos
19.
IEEE Trans Pattern Anal Mach Intell ; 40(7): 1555-1569, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-28749346

RESUMO

This paper presents an attribute and-or grammar (A-AOG) model for jointly inferring human body pose and human attributes in a parse graph with attributes augmented to nodes in the hierarchical representation. In contrast to other popular methods in the current literature that train separate classifiers for poses and individual attributes, our method explicitly represents the decomposition and articulation of body parts, and account for the correlations between poses and attributes. The A-AOG model is an amalgamation of three traditional grammar formulations: (i) Phrase structure grammar representing the hierarchical decomposition of the human body from whole to parts; (ii) Dependency grammar modeling the geometric articulation by a kinematic graph of the body pose; and (iii) Attribute grammar accounting for the compatibility relations between different parts in the hierarchy so that their appearances follow a consistent style. The parse graph outputs human detection, pose estimation, and attribute prediction simultaneously, which are intuitive and interpretable. We conduct experiments on two tasks on two datasets, and experimental results demonstrate the advantage of joint modeling in comparison with computing poses and attributes independently. Furthermore, our model obtains better performance over existing methods for both pose estimation and attribute prediction tasks.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Postura/fisiologia , Adulto , Idoso , Criança , Bases de Dados Factuais , Feminino , Humanos , Masculino , Modelos Teóricos , Terminologia como Assunto
20.
IEEE Trans Pattern Anal Mach Intell ; 40(3): 710-725, 2018 03.
Artigo em Inglês | MEDLINE | ID: mdl-28368817

RESUMO

In this paper, we present an attribute grammar for solving two coupled tasks: i) parsing a 2D image into semantic regions; and ii) recovering the 3D scene structures of all regions. The proposed grammar consists of a set of production rules, each describing a kind of spatial relation between planar surfaces in 3D scenes. These production rules are used to decompose an input image into a hierarchical parse graph representation where each graph node indicates a planar surface or a composite surface. Different from other stochastic image grammars, the proposed grammar augments each graph node with a set of attribute variables to depict scene-level global geometry, e.g., camera focal length, or local geometry, e.g., surface normal, contact lines between surfaces. These geometric attributes impose constraints between a node and its off-springs in the parse graph. Under a probabilistic framework, we develop a Markov Chain Monte Carlo method to construct a parse graph that optimizes the 2D image recognition and 3D scene reconstruction purposes simultaneously. We evaluated our method on both public benchmarks and newly collected datasets. Experiments demonstrate that the proposed method is capable of achieving state-of-the-art scene reconstruction of a single image.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...