...by creating action gaps similar in size - Microsoft Research
when we try to learn this policy by optimizing for the discounted sum of rewards using a linear representation based on tile coding, the learning behavior shows a very strong dependence on the discount factor (see Figure 1). In particular,...