文中,作者加速训练的具体方法为:(1) 使用 shifted NTK (neural tangent kernel),(2) 证明了在每一个训练 iteration中,每一个 data point input 下,事实上只有很小一部分的 neuron 被激活了。(3) 通过 geometric search 来找到 activated neurons (4) 证明了新算法可以线性收敛地把 training loss 降为0。此...
We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. ...
《Overparameterized Neural Networks Can Implement Associative Memory》A Radhakrishnan, M Belkin, C Uhler [MIT & The Ohio State University] (2019) http://t.cn/Aim2zTcV view:http://t.cn/Aim2zTc5
One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU...
We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima...
Convolutional layers are the core building blocks of Convolutional Neural Networks (CNNs). In this paper, we propose to augment a convolutional layer with an additional depthwise convolution, where each input channel is convolved with a different 2D kernel. The composition of the two convolutions ...
This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-...
policy-based objectives are unstable. We formally prove upper bounds on the regret of overparameterized value-based learning and lower bounds on the regret for policy-based algorithms. In our experiments with large neural networks, this gap between action-stable value-based objectives and unstable po...
Baseline37.59258K28.032.99M32.9511.7M HS-MHE [49]34.97258K25.962.99M32.5011.7M OPT (GS)33.021.36MOOM16.2MOOM46.5M S-OPT (GS)33.7090.9K25.591.04M32.263.39M Table 11: Sampling dim. p=Error (%)Params dOOM16.2M d/425.591.04M d/828.61278K ...
For two-layer ReLU neural networks, we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that K-FAC, an approximate ...