where γ is the discount rate and 0 ≤ γ ≤ 1. The goal of RL is to maximize the total discounted return for each state and action selected by the policy π, which is specified by a conditional probability of action a for each state s, denoted as π(a∣s). In this work...