We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the ...
Abstract Emphatic temporal-difference (ETD) learning (Sutton et al. 2016) is a pioneering off-policy reinforcement learning method involving the use of thefollowon trace. The recently proposed gradient emphasis learning (GEM, Zhang et al. 2020) algorithm is used to fix the problems of unbounded ...
有人使用了层次学习的方法来训练low-level policy解决subgoal,high-level controller来产生这些subgoal。比如:(paper) Data- efficient hierarchical reinforcement learning 和 (paper) Hierarchical actor-critic 这两个文章(联想:这个东西好像可以和modular pipeline结合起来?) #Idea inverse RL 一个热门工作是GAIL和Info...
We utilize recent advances in reinforcement learning (RL) and variational Bayes (VB), such as off-policy critic learning and unbiased-and-low-variance gradient estimation, and review them in the context of SCGs. The generalized backpropagation extends transported learning signals beyond gradients ...