One possible reason is due to the use of constant θ in AMSGrad and original Adam. By Figures 2 and 3, we can observe that the convergences of Generic Adam are extremely sensitive to the choice of parameter θt. Larger r can contribute to a faster convergence rate...