Off-Policy Prediction Learning: An Empirical Study of Online Algorithms.

Ghiassian, Sina; Rafiee, Banafsheh; Sutton, Richard S

ABSTRACT

Off-policy prediction-learning the value function for one policy from data generated while following another policy-is one of the most challenging problems in reinforcement learning. This article makes two main contributions 1) it empirically studies 11 off-policy prediction learning algorithms with emphasis on their sensitivity to parameters, learning speed, and asymptotic error and 2) based on the empirical results, it proposes two step-size adaptation methods called and that help the algorithm with the lowest error from the experimental study learn faster. Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. In this article, we empirically compare 11 off-policy prediction learning algorithms with linear function approximation on three small tasks the Collision task, the task, and the task. The Collision task is a small off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. The and tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as 214 . To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which, in turn, slows down learning. The task is more extreme in that the product of the ratios can become as large as 214 × 25 . The algorithms considered are Off-policy TD, five Gradient-TD algorithms, two Emphatic-TD algorithms, Vtrace, and variants of Tree Backup and ABQ that are applicable to the prediction setting. We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. Tree Backup, Vtrace, and ABTDare not affected by the high variance as much as other algorithms, but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TDtends to have lower asymptotic error than other algorithms but might learn more slowly in some cases. Based on the empirical results, we propose two step-size adaptation algorithms, which we collectively refer to as the Ratchet algorithms, with the same underlying idea keep the step-size parameter as large as possible and ratchet it down only when necessary to avoid overshoot. We show that the Ratchet algorithms are effective by comparing them with other popular step-size adaptation algorithms, such as the Adam optimizer.