Your browser doesn't support javascript.
loading
A unified framework to control estimation error in reinforcement learning.
Zhang, Yujia; Li, Lin; Wei, Wei; Lv, Yunpeng; Liang, Jiye.
Afiliação
  • Zhang Y; Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China. Electronic address: 342564535@qq.com.
  • Li L; Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China. Electronic address: lilynn1116@sxu.edu.cn.
  • Wei W; Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China. Electronic address: weiwei@sxu.edu.cn.
  • Lv Y; Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China. Electronic address: 1679021190@qq.com.
  • Liang J; Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China. Electronic address: ljy@sxu.edu.cn.
Neural Netw ; 178: 106483, 2024 Oct.
Article em En | MEDLINE | ID: mdl-38954893
ABSTRACT
In reinforcement learning, accurate estimation of the Q-value is crucial for acquiring an optimal policy. However, current successful Actor-Critic methods still suffer from underestimation bias. Additionally, there exists a significant estimation bias, regardless of the method used in the critic initialization phase. To address these challenges and reduce estimation errors, we propose CEILING, a simple and compatible framework that can be applied to any model-free Actor-Critic methods. The core idea of CEILING is to evaluate the superiority of different estimation methods by incorporating the true Q-value, calculated using Monte Carlo, during the training process. CEILING consists of two implementations the Direct Picking Operation and the Exponential Softmax Weighting Operation. The first implementation selects the optimal method at each fixed step and applies it in subsequent interactions until the next selection. The other implementation utilizes a nonlinear weighting function that dynamically assigns larger weights to more accurate methods. Theoretically, we demonstrate that our methods provide a more accurate and stable Q-value estimation. Additionally, we analyze the upper bound of the estimation bias. Based on two implementations, we propose specific algorithms and their variants, and our methods achieve superior performance on several benchmark tasks.
Assuntos
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Reforço Psicológico / Algoritmos Limite: Humans Idioma: En Revista: Neural Netw Assunto da revista: NEUROLOGIA Ano de publicação: 2024 Tipo de documento: Article País de publicação: Estados Unidos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Reforço Psicológico / Algoritmos Limite: Humans Idioma: En Revista: Neural Netw Assunto da revista: NEUROLOGIA Ano de publicação: 2024 Tipo de documento: Article País de publicação: Estados Unidos