之前主流的用来缓解该问题的方法是学习率warm-up,在训练的前几个周期,从一个比较小的学习率开始,线性增加到最终使用的学习率(也就是k倍增加后的学习率)。作者从自己的实验观察出发,提出了可以替代warm-up的方法——分层自适应学习率缩放(Layer-wise Adaptive Rate Scaling),从一个新颖的方向缓解学习率过大的问题。
However, the conventional method prunes redundant parameters up to the same rate for all layers. It may cause a bottleneck problem, which leads to the performance degradation, because the minimum number of optimal parameters is different according to the each layer. We propose a layer adaptive ...
However, the conventional method prunes redundant parameters up to the same rate for all layers. It may cause a bottleneck problem, which leads to the performance degradation, because the minimum number of optimal parameters is different according to the each layer. We propose a layer adaptive ...
However, the conventional method prunes redundant parameters up to the same rate for all layers. It may cause a bottleneck problem, which leads to the performance degradation, because the minimum number of optimal parameters is different according to the each layer. We propose a layer adaptive ...