We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows...
In each iter- ation, the current gradient ∇xJ(x∗t , y) is normalized by the L1 distance (any distance measure is feasible) of itself, be- cause we notice that the scale of the gradients in different iterations varies in magnitude. 3.2. Attacking ensemble of models In this section,...
†: this entry is with BN frozen, which improves results; see main text. 4.2.3 More Downstream Tasks Table 6 shows more downstream tasks (implementation details in appendix). Overall, MoCo performs competitively with ImageNet supervised pre-training: COCO keypoint detection: supervised pre-...
This out- put vector is normalized by its L2-norm [61]. This is the representation of the query or key. The temperature τ in Eqn.(1) is set as 0.07 [61]. The data augmentation setting follows [61]: a 224×224-pixel crop is taken from a ran- domly resiz...
A stochastic gradient descent (SGD) optimizer is one of the heavily used optimization methods in deep learning. Stochastic gradient descent (SGD) serves as a popular optimizer in deep learning. It is a non-adaptive learning rate method. That is, the learning rate needs to be manually ...