Pesquisa | Biblioteca Virtual em Saúde

1.

Online estimation of objective function for continuous-time deterministic systems.

Asl, Hamed Jabbari; Uchibe, Eiji.

Neural Netw ; 172: 106116, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-38242009

RESUMO

We developed two online data-driven methods for estimating an objective function in continuous-time linear and nonlinear deterministic systems. The primary focus addressed the challenge posed by unknown input dynamics (control mapping function) in the expert system, a critical element for an online solution of the problem. Our methods leverage both the learner's and expert's data for effective problem-solving. The first approach, which is model-free, estimates the expert's policy and integrates it into the learner agent to approximate the objective function associated with the optimal policy. The second approach estimates the input dynamics from the learner's data and combines it with the expert's input-state observations to tackle the objective function estimation problem. Compared to other methods for deterministic systems that rely on both the learner's and expert's data, our approaches offer reduced complexity by eliminating the need to estimate an optimal policy after each objective function update. We conduct a convergence analysis of the estimation techniques using Lyapunov-based methods. Numerical experiments validate the effectiveness of our developed methods.

Assuntos

Redes Neurais de Computação

2.

Deep learning, reinforcement learning, and world models.

Matsuo, Yutaka; LeCun, Yann; Sahani, Maneesh; Precup, Doina; Silver, David; Sugiyama, Masashi; Uchibe, Eiji; Morimoto, Jun.

Neural Netw ; 152: 267-275, 2022 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-35569196

RESUMO

Deep learning (DL) and reinforcement learning (RL) methods seem to be a part of indispensable factors to achieve human-level or super-human AI systems. On the other hand, both DL and RL have strong connections with our brain functions and with neuroscientific findings. In this review, we summarize talks and discussions in the "Deep Learning and Reinforcement Learning" session of the symposium, International Symposium on Artificial Intelligence and Brain Science. In this session, we discussed whether we can achieve comprehensive understanding of human intelligence based on the recent advances of deep learning and reinforcement learning algorithms. Speakers contributed to provide talks about their recent studies that can be key technologies to achieve human-level intelligence.

Assuntos

Inteligência Artificial , Aprendizado Profundo , Algoritmos , Humanos , Reforço Psicológico

3.

Parallel and hierarchical neural mechanisms for adaptive and predictive behavioral control.

Macpherson, Tom; Matsumoto, Masayuki; Gomi, Hiroaki; Morimoto, Jun; Uchibe, Eiji; Hikida, Takatoshi.

Neural Netw ; 144: 507-521, 2021 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-34601363

RESUMO

Our brain can be recognized as a network of largely hierarchically organized neural circuits that operate to control specific functions, but when acting in parallel, enable the performance of complex and simultaneous behaviors. Indeed, many of our daily actions require concurrent information processing in sensorimotor, associative, and limbic circuits that are dynamically and hierarchically modulated by sensory information and previous learning. This organization of information processing in biological organisms has served as a major inspiration for artificial intelligence and has helped to create in silico systems capable of matching or even outperforming humans in several specific tasks, including visual recognition and strategy-based games. However, the development of human-like robots that are able to move as quickly as humans and respond flexibly in various situations remains a major challenge and indicates an area where further use of parallel and hierarchical architectures may hold promise. In this article we review several important neural and behavioral mechanisms organizing hierarchical and predictive processing for the acquisition and realization of flexible behavioral control. Then, inspired by the organizational features of brain circuits, we introduce a multi-timescale parallel and hierarchical learning framework for the realization of versatile and agile movement in humanoid robots.

Assuntos

Inteligência Artificial , Robótica , Controle Comportamental , Simulação por Computador , Humanos , Aprendizagem

4.

Forward and inverse reinforcement learning sharing network weights and hyperparameters.

Uchibe, Eiji; Doya, Kenji.

Neural Netw ; 144: 138-153, 2021 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-34492548

RESUMO

This paper proposes model-free imitation learning named Entropy-Regularized Imitation Learning (ERIL) that minimizes the reverse Kullback-Leibler (KL) divergence. ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process. An inverse RL step computes the log-ratio between two distributions by evaluating two binary discriminators. The first discriminator distinguishes the state generated by the forward RL step from the expert's state. The second discriminator, which is structured by the theory of entropy regularization, distinguishes the state-action-next-state tuples generated by the learner from the expert ones. One notable feature is that the second discriminator shares hyperparameters with the forward RL, which can be used to control the discriminator's ability. A forward RL step minimizes the reverse KL estimated by the inverse RL step. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy. Our experimental results on MuJoCo-simulated environments and vision-based reaching tasks with a robotic arm show that ERIL is more sample-efficient than the baseline methods. We apply the method to human behaviors that perform a pole-balancing task and describe how the estimated reward functions show how every subject achieves her goal.

Assuntos

Aprendizagem , Reforço Psicológico , Entropia , Feminino , Humanos , Cadeias de Markov , Recompensa

5.

Modular deep reinforcement learning from reward and punishment for robot navigation.

Wang, Jiexin; Elfwing, Stefan; Uchibe, Eiji.

Neural Netw ; 135: 115-126, 2021 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-33383526

RESUMO

Modular Reinforcement Learning decomposes a monolithic task into several tasks with sub-goals and learns each one in parallel to solve the original problem. Such learning patterns can be traced in the brains of animals. Recent evidence in neuroscience shows that animals utilize separate systems for processing rewards and punishments, illuminating a different perspective for modularizing Reinforcement Learning tasks. MaxPain and its deep variant, Deep MaxPain, showed the advances of such dichotomy-based decomposing architecture over conventional Q-learning in terms of safety and learning efficiency. These two methods differ in policy derivation. MaxPain linearly unified the reward and punishment value functions and generated a joint policy based on unified values; Deep MaxPain tackled scaling problems in high-dimensional cases by linearly forming a joint policy from two sub-policies obtained from their value functions. However, the mixing weights in both methods were determined manually, causing inadequate use of the learned modules. In this work, we discuss the signal scaling of reward and punishment related to discounting factor Î³, and propose a weak constraint for signaling design. To further exploit the learning models, we propose a state-value dependent weighting scheme that automatically tunes the mixing weights: hard-max and softmax based on a case analysis of Boltzmann distribution. We focus on maze-solving navigation tasks and investigate how two metrics (pain-avoiding and goal-reaching) influence each other's behaviors during learning. We propose a sensor fusion network structure that utilizes lidar and images captured by a monocular camera instead of lidar-only and image-only sensing. Our results, both in the simulation of three types of mazes with different complexities and a real robot experiment of an L-maze on Turtlebot3 Waffle Pi, showed the improvements of our methods.

Assuntos

Aprendizado Profundo , Punição , Reforço Psicológico , Recompensa , Robótica/métodos , Animais , Simulação por Computador , Aprendizagem em Labirinto/fisiologia , Motivação/fisiologia

6.

Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning.

Ohnishi, Shota; Uchibe, Eiji; Yamaguchi, Yotaro; Nakanishi, Kosuke; Yasui, Yuji; Ishii, Shin.

Front Neurorobot ; 13: 103, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31920613

RESUMO

A deep Q network (DQN) (Mnih et al., 2013) is an extension of Q learning, which is a typical deep reinforcement learning method. In DQN, a Q function expresses all action values under all states, and it is approximated using a convolutional neural network. Using the approximated Q function, an optimal policy can be derived. In DQN, a target network, which calculates a target value and is updated by the Q function at regular intervals, is introduced to stabilize the learning process. A less frequent updates of the target network would result in a more stable learning process. However, because the target value is not propagated unless the target network is updated, DQN usually requires a large number of samples. In this study, we proposed Constrained DQN that uses the difference between the outputs of the Q function and the target network as a constraint on the target value. Constrained DQN updates parameters conservatively when the difference between the outputs of the Q function and the target network is large, and it updates them aggressively when this difference is small. In the proposed method, as learning progresses, the number of times that the constraints are activated decreases. Consequently, the update method gradually approaches conventional Q learning. We found that Constrained DQN converges with a smaller training dataset than in the case of DQN and that it is robust against changes in the update frequency of the target network and settings of a certain parameter of the optimizer. Although Constrained DQN alone does not show better performance in comparison to integrated approaches nor distributed methods, experimental results show that Constrained DQN can be used as an additional components to those methods.

7.

Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules.

Uchibe, Eiji.

Front Neurorobot ; 12: 61, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30319389

RESUMO

This paper proposes Cooperative and competitive Reinforcement And Imitation Learning (CRAIL) for selecting an appropriate policy from a set of multiple heterogeneous modules and training all of them in parallel. Each learning module has its own network architecture and improves the policy based on an off-policy reinforcement learning algorithm and behavior cloning from samples collected by a behavior policy that is constructed by a combination of all the policies. Since the mixing weights are determined by the performance of the module, a better policy is automatically selected based on the learning progress. Experimental results on a benchmark control task show that CRAIL successfully achieves fast learning by allowing modules with complicated network structures to exploit task-relevant samples for training.

8.

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.

Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji.

Neural Netw ; 107: 3-11, 2018 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-29395652

RESUMO

In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10 × 10 board, using TD(λ) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa(λ) agent with SiLU and dSiLU hidden units.

Assuntos

Aprendizado Profundo , Redes Neurais de Computação

9.

Adaptive Baseline Enhances EM-Based Policy Search: Validation in a View-Based Positioning Task of a Smartphone Balancer.

Wang, Jiexin; Uchibe, Eiji; Doya, Kenji.

Front Neurorobot ; 11: 1, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28167910

RESUMO

EM-based policy search methods estimate a lower bound of the expected return from the histories of episodes and iteratively update the policy parameters using the maximum of a lower bound of expected return, which makes gradient calculation and learning rate tuning unnecessary. Previous algorithms like Policy learning by Weighting Exploration with the Returns, Fitness Expectation Maximization, and EM-based Policy Hyperparameter Exploration implemented the mechanisms to discard useless low-return episodes either implicitly or using a fixed baseline determined by the experimenter. In this paper, we propose an adaptive baseline method to discard worse samples from the reward history and examine different baselines, including the mean, and multiples of SDs from the mean. The simulation results of benchmark tasks of pendulum swing up and cart-pole balancing, and standing up and balancing of a two-wheeled smartphone robot showed improved performances. We further implemented the adaptive baseline with mean in our two-wheeled smartphone robot hardware to test its performance in the standing up and balancing task, and a view-based approaching task. Our results showed that with adaptive baseline, the method outperformed the previous algorithms and achieved faster, and more precise behaviors at a higher successful rate.

10.

From free energy to expected energy: Improving energy-based value function approximation in reinforcement learning.

Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji.

Neural Netw ; 84: 17-27, 2016 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-27639720

RESUMO

Free-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state and action spaces. However, the FERL method does only really work well with binary, or close to binary, state input, where the number of active states is fewer than the number of non-active states. In the FERL method, the value function is approximated by the negative free energy of a restricted Boltzmann machine (RBM). In our earlier study, we demonstrated that the performance and the robustness of the FERL method can be improved by scaling the free energy by a constant that is related to the size of network. In this study, we propose that RBM function approximation can be further improved by approximating the value function by the negative expected energy (EERL), instead of the negative free energy, as well as being able to handle continuous state input. We validate our proposed method by demonstrating that EERL: (1) outperforms FERL, as well as standard neural network and linear function approximation, for three versions of a gridworld task with high-dimensional image state input; (2) achieves new state-of-the-art results in stochastic SZ-Tetris in both model-free and model-based learning settings; and (3) significantly outperforms FERL and standard neural network function approximation for a robot navigation task with raw and noisy RGB images as state input and a large number of actions.

Assuntos

Aprendizado de Máquina , Estimulação Luminosa/métodos , Reforço Psicológico , Humanos , Aprendizagem , Aprendizado de Máquina/tendências , Modelos Teóricos , Redes Neurais de Computação , Distribuição Aleatória

11.

Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task.

Kinjo, Ken; Uchibe, Eiji; Doya, Kenji.

Front Neurorobot ; 7: 7, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-23576983

RESUMO

Linearly solvable Markov Decision Process (LMDP) is a class of optimal control problem in which the Bellman's equation can be converted into a linear equation by an exponential transformation of the state value function (Todorov, 2009b). In an LMDP, the optimal value function and the corresponding control policy are obtained by solving an eigenvalue problem in a discrete state space or an eigenfunction problem in a continuous state using the knowledge of the system dynamics and the action, state, and terminal cost functions. In this study, we evaluate the effectiveness of the LMDP framework in real robot control, in which the dynamics of the body and the environment have to be learned from experience. We first perform a simulation study of a pole swing-up task to evaluate the effect of the accuracy of the learned dynamics model on the derived the action policy. The result shows that a crude linear approximation of the non-linear dynamics can still allow solution of the task, despite with a higher total cost. We then perform real robot experiments of a battery-catching task using our Spring Dog mobile robot platform. The state is given by the position and the size of a battery in its camera view and two neck joint angles. The action is the velocities of two wheels, while the neck joints were controlled by a visual servo controller. We test linear and bilinear dynamic models in tasks with quadratic and Guassian state cost functions. In the quadratic cost task, the LMDP controller derived from a learned linear dynamics model performed equivalently with the optimal linear quadratic regulator (LQR). In the non-quadratic task, the LMDP controller with a linear dynamics model showed the best performance. The results demonstrate the usefulness of the LMDP framework in real robot control even when simple linear models are used for dynamics learning.

12.

Scaled free-energy based reinforcement learning for robust and efficient learning in high-dimensional state spaces.

Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji.

Front Neurorobot ; 7: 3, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-23450126

RESUMO

Free-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state- and action spaces, which cannot be handled by standard function approximation methods. In this study, we propose a scaled version of free-energy based reinforcement learning to achieve more robust and more efficient learning performance. The action-value function is approximated by the negative free-energy of a restricted Boltzmann machine, divided by a constant scaling factor that is related to the size of the Boltzmann machine (the square root of the number of state nodes in this study). Our first task is a digit floor gridworld task, where the states are represented by images of handwritten digits from the MNIST data set. The purpose of the task is to investigate the proposed method's ability, through the extraction of task-relevant features in the hidden layer, to cluster images of the same digit and to cluster images of different digits that corresponds to states with the same optimal action. We also test the method's robustness with respect to different exploration schedules, i.e., different settings of the initial temperature and the temperature discount rate in softmax action selection. Our second task is a robot visual navigation task, where the robot can learn its position by the different colors of the lower part of four landmarks and it can infer the correct corner goal area by the color of the upper part of the landmarks. The state space consists of binarized camera images with, at most, nine different colors, which is equal to 6642 binary states. For both tasks, the learning performance is compared with standard FERL and with function approximation where the action-value function is approximated by a two-layered feedforward neural network.

13.

Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning.

Morimura, Tetsuro; Uchibe, Eiji; Yoshimoto, Junichiro; Peters, Jan; Doya, Kenji.

Neural Comput ; 22(2): 342-76, 2010 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-19842990

RESUMO

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate gamma for the value functions close to 1, these algorithms do not permit gamma to be set exactly at gamma = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting gamma = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.

Assuntos

Algoritmos , Inteligência Artificial , Modelos Teóricos , Redes Neurais de Computação , Simulação por Computador , Conceitos Matemáticos , Reforço Psicológico , Recompensa

14.

Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

Uchibe, Eiji; Doya, Kenji.

Neural Netw ; 21(10): 1447-55, 2008 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-19013054

RESUMO

Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors.

Assuntos

Inteligência Artificial , Modelos Teóricos , Robótica , Algoritmos , Computadores , Fontes de Energia Elétrica , Retroalimentação , Reforço Psicológico , Software , Processos Estocásticos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA