This results in the activation distributions changing a lot less, meaning that the weights of the 3rd layer are easier to get adjusted to the optimal values, and thus results in faster training times. Great blog post by Rohan Varma Label smoothing Model averaging Regularizers are thought to ...