Vanilla Policy Gradient (with GAE-Lambda for advantage estimation) Parameters: env_fn –A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API. actor_critic – The constructor method for a PyTorch Module with a step method, an act method, a pi mo...
策略梯度(Policy Gradient, PG)方法的核心思想在于是能获得更好的回报的动作的采样概率不断提高,使获得更少回报的动作的采样概率不断降低,从而达到一个最优的策略。 2 知识速览 标准的策略梯度算法(Vanilla Policy Gradient, VPG)属于在策略(on-policy)算法 VPG算法可以被用到离散和连续动作空间中 Spinning Up中的...