We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the ...
provide generalized lower bounds for the optimal Q-functions. Practical lower bounds should possess several desiderata: (P.1) they could be estimated usingoff-policypartial trajectories; (P.2) they could bootstrap from learned Q-functions Generalized SIL with stochastic actor-critic: L_{\text {va...
Imani E, Graves E, White M (2018) An off-policy policy gradient theorem using emphatic weightings. In: Advances in neural information processing systems, pp 96–106 Zhang S, Liu B, Yao H, Whiteson S (2020) Provably convergent two-timescale off-policy actor-critic with function approximatio...
Notably, the vast majority of trainable parameters are shared between policy πθ and value-function estimate Vϕ, except for their final output step. This approach is relatively common in actor–critic models (Mnih et al., 2016, Schulman et al., 2017), and constitutes the basis of all ...