Here’s what the pseudocode for the algorithm looks like Initialize policy parameters for k = 1 to K do: collect N trajectories by rolling out the stochastic policy compute for each pair along the trajectorie