Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-38231811

RESUMO

We focus on learning the zero-constraint-violation safe policy in model-free reinforcement learning (RL). Existing model-free RL studies mostly use the posterior penalty to penalize dangerous actions, which means they must experience the danger to learn from the danger. Therefore, they cannot learn a zero-violation safe policy even after convergence. To handle this problem, we leverage the safety-oriented energy functions to learn zero-constraint-violation safe policies and propose the safe set actor-critic (SSAC) algorithm. The energy function is designed to increase rapidly for potentially dangerous actions, locating the safe set on the action space. Therefore, we can identify the dangerous actions prior to taking them and achieve zero-constraint violation. Our major contributions are twofold. First, we use the data-driven methods to learn the energy function, which releases the requirement of known dynamics. Second, we formulate a constrained RL problem to solve the zero-violation policies. We prove that our Lagrangian-based constrained RL solutions converge to the constrained optimal zero-violation policies theoretically. The proposed algorithm is evaluated on the complex simulation environments and a hardware-in-loop (HIL) experiment with a real autonomous vehicle controller. Experimental results suggest that the converged policies in all environments achieve zero-constraint violation and comparable performance with model-based baseline.

2.
Artigo em Inglês | MEDLINE | ID: mdl-38015685

RESUMO

Model error and external disturbance have been separately addressed by optimizing the definite H∞ performance in standard linear H∞ control problems. However, the concurrent handling of both introduces uncertainty and nonconvexity into the H∞ performance, posing a huge challenge for solving nonlinear problems. This article introduces an additional cost function in the augmented Hamilton-Jacobi-Isaacs (HJI) equation of zero-sum games to simultaneously manage the model error and external disturbance in nonlinear robust performance problems. For satisfying the Hamilton-Jacobi inequality in nonlinear robust control theory under all considered model errors, the relationship between the additional cost function and model uncertainty is revealed. A critic online learning algorithm, applying Lyapunov stabilizing terms and historical states to reinforce training stability and achieve persistent learning, is proposed to approximate the solution of the augmented HJI equation. By constructing a joint Lyapunov candidate about the critic weight and system state, both stability and convergence are proved by the second method of Lyapunov. Theoretical results also show that introducing historical data reduces the ultimate bounds of system state and critic error. Three numerical examples are conducted to demonstrate the effectiveness of the proposed method.

3.
IEEE Trans Cybern ; PP2023 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-37883283

RESUMO

In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This article analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L -smoothness, and M -Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence toward local minima when initialized near such minima. This article concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.

4.
IEEE Trans Neural Netw Learn Syst ; 34(9): 5255-5267, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37015565

RESUMO

The Hamilton-Jacobi-Bellman (HJB) equation serves as the necessary and sufficient condition for the optimal solution to the continuous-time (CT) optimal control problem (OCP). Compared with the infinite-horizon HJB equation, the solving of the finite-horizon (FH) HJB equation has been a long-standing challenge, because the partial time derivative of the value function is involved as an additional unknown term. To address this problem, this study first-time bridges the link between the partial time derivative and the terminal-time utility function, and thus it facilitates the use of the policy iteration (PI) technique to solve the CT FH OCPs. Based on this key finding, the FH approximate dynamic programming (ADP) algorithm is proposed leveraging an actor-critic framework. It is shown that the algorithm exhibits important properties in terms of convergence and optimality. Rather importantly, with the use of multilayer neural networks (NNs) in the actor-critic architecture, the algorithm is suitable for CT FH OCPs toward more general nonlinear and complex systems. Finally, the effectiveness of the proposed algorithm is demonstrated by conducting a series of simulations on both a linear quadratic regulator (LQR) problem and a nonlinear vehicle tracking problem.

5.
IEEE Trans Cybern ; 53(2): 859-873, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35439160

RESUMO

Decision and control are core functionalities of high-level automated vehicles. Current mainstream methods, such as functional decomposition and end-to-end reinforcement learning (RL), suffer high time complexity or poor interpretability and adaptability on real-world autonomous driving tasks. In this article, we present an interpretable and computationally efficient framework called integrated decision and control (IDC) for automated vehicles, which decomposes the driving task into static path planning and dynamic optimal tracking that are structured hierarchically. First, the static path planning generates several candidate paths only considering static traffic elements. Then, the dynamic optimal tracking is designed to track the optimal path while considering the dynamic obstacles. To that end, we formulate a constrained optimal control problem (OCP) for each candidate path, optimize them separately, and follow the one with the best tracking performance. To unload the heavy online computation, we propose a model-based RL algorithm that can be served as an approximate-constrained OCP solver. Specifically, the OCPs for all paths are considered together to construct a single complete RL problem and then solved offline in the form of value and policy networks for real-time online path selecting and tracking, respectively. We verify our framework in both simulations and the real world. Results show that compared with baseline methods, IDC has an order of magnitude higher online computing efficiency, as well as better driving performance, including traffic efficiency and safety. In addition, it yields great interpretability and adaptability among different driving scenarios and tasks.

6.
Artigo em Inglês | MEDLINE | ID: mdl-35635820

RESUMO

Safety is essential for reinforcement learning (RL) applied in the real world. Adding chance constraints (or probabilistic constraints) is a suitable way to enhance RL safety under uncertainty. Existing chance-constrained RL methods, such as the penalty methods and the Lagrangian methods, either exhibit periodic oscillations or learn an overconservative or unsafe policy. In this article, we address these shortcomings by proposing a separated proportional-integral Lagrangian (SPIL) algorithm. We first review the constrained policy optimization process from a feedback control perspective, which regards the penalty weight as the control input and the safe probability as the control output. Based on this, the penalty method is formulated as a proportional controller, and the Lagrangian method is formulated as an integral controller. We then unify them and present a proportional-integral Lagrangian method to get both their merits with an integral separation technique to limit the integral value to a reasonable range. To accelerate training, the gradient of safe probability is computed in a model-based manner. The convergence of the overall algorithm is analyzed. We demonstrate that our method can reduce the oscillations and conservatism of RL policy in a car-following simulation. To prove its practicality, we also apply our method to a real-world mobile robot navigation task, where our robot successfully avoids a moving obstacle with highly uncertain or even aggressive behaviors.

7.
IEEE Trans Neural Netw Learn Syst ; 33(11): 6584-6598, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34101599

RESUMO

In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.

8.
Sci Rep ; 11(1): 3996, 2021 02 17.
Artigo em Inglês | MEDLINE | ID: mdl-33597565

RESUMO

Human reaction plays a key role in improved protection upon emergent traffic situations with motor vehicles. Understanding the underlying behaviour mechanisms can combine active sensing system on feature caption and passive devices on injury mitigation for automated vehicles. The study aims to identify the distance-based safety boundary ("safety envelope") of vehicle-pedestrian conflicts via pedestrian active avoidance behaviour recorded in well-controlled, immersive virtual reality-based emergent traffic scenarios. Via physiological signal measurement and kinematics reconstruction of the complete sequence, we discovered the general perception-decision-action mechanisms under given external stimulus, and the resultant certain level of natural harm-avoidance action. Using vision as the main information source, 70% pedestrians managed to avoid the collision by adapting walking speeds and directions, consuming overall less "decision" time (0.17-0.24 s vs. 0.41 s) than the collision cases, after that, pedestrians need enough "execution" time (1.52-1.84 s) to take avoidance action. Safety envelopes were generated by combining the simultaneous interactions between the pedestrian and the vehicle. The present investigation on emergent reaction dynamics clears a way for realistic modelling of biomechanical behaviour, and preliminarily demonstrates the feasibility of incorporating in vivo pedestrian behaviour into engineering design which can facilitate improved, interactive on-board devices towards global optimal safety.


Assuntos
Acidentes de Trânsito/psicologia , Aprendizagem da Esquiva , Pedestres/psicologia , Adulto , Condução de Veículo , Tomada de Decisões , Planejamento Ambiental , Humanos , Masculino , Modelos Teóricos , Veículos Automotores , Tempo de Reação , Fatores de Risco , Segurança , Realidade Virtual , Caminhada , Adulto Jovem
9.
Accid Anal Prev ; 108: 74-82, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28858775

RESUMO

Bicycling is one of the fundamental modes of transportation especially in developing countries. Because of the lack of effective protection for bicyclists, vehicle-bicycle (V-B) accident has become a primary contributor to traffic fatalities. Although AEB (Autonomous Emergency Braking) systems have been developed to avoid or mitigate collisions, they need to be further adapted in various conflict situations. This paper analyzes the driver's braking behavior in typical V-B conflicts of China to improve the performance of Bicyclist-AEB systems. Naturalistic driving data were collected, from which the top three scenarios of V-B accidents in China were extracted, including SCR (a bicycle crossing the road from right while a car is driving straight), SCL (a bicycle crossing the road from left while a car is driving straight) and SSR (a bicycle swerving in front of the car from right while a car is driving straight). For safety and data reliability, a driving simulator was employed to reconstruct these three scenarios and some 25 licensed drivers were recruited for braking behavior analysis. Results revealed that driver's braking behavior was significantly influenced by V-B conflict types. Pre-decelerating behaviors were found in SCL and SSR conflicts, whereas in SCR the subjects were less vigilant. The brake reaction time and brake severity in lateral V-B conflicts (SCR and SCL) was shorter and higher than that in longitudinal conflicts (SSR). The findings improve their applications in the Bicyclist-AEB and test protocol enactment to enhance the performance of Bicyclist-AEB systems in mixed traffic situations especially for developing countries.


Assuntos
Acidentes de Trânsito/prevenção & controle , Condução de Veículo/psicologia , Ciclismo , Adulto , China , Desaceleração , Emergências , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Tempo de Reação , Reprodutibilidade dos Testes , Adulto Jovem
10.
Sensors (Basel) ; 17(3)2017 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-28257094

RESUMO

This paper presents a drowsiness on-line detection system for monitoring driver fatigue level under real driving conditions, based on the data of steering wheel angles (SWA) collected from sensors mounted on the steering lever. The proposed system firstly extracts approximate entropy (ApEn)featuresfromfixedslidingwindowsonreal-timesteeringwheelanglestimeseries. Afterthat, this system linearizes the ApEn features series through an adaptive piecewise linear fitting using a given deviation. Then, the detection system calculates the warping distance between the linear features series of the sample data. Finally, this system uses the warping distance to determine the drowsiness state of the driver according to a designed binary decision classifier. The experimental data were collected from 14.68 h driving under real road conditions, including two fatigue levels: "wake" and "drowsy". The results show that the proposed system is capable of working online with an average 78.01% accuracy, 29.35% false detections of the "awake" state, and 15.15% false detections of the "drowsy" state. The results also confirm that the proposed method based on SWA signal is valuable for applications in preventing traffic accidents caused by driver fatigue.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA