Initial learning rate is 0.002 and decreased by a factor of 5 as validation errors stop decreasing for 2 epochs. All variants use the same scheme with 30 total epochs determined based on the validation set. We apply gradient clipping at the magnitude of 10, and find it with in place ...