文中,作者加速训练的具体方法为:(1) 使用 shifted NTK (neural tangent kernel),(2) 证明了在每一个训练 iteration中,每一个 data point input 下,事实上只有很小一部分的 neuron 被激活了。(3) 通过 geometric search 来找到 activated neurons (4) 证明了新算法可以线性收敛地把 training loss 降为0。此...
《Overparameterized Neural Networks Can Implement Associative Memory》A Radhakrishnan, M Belkin, C Uhler [MIT & The Ohio State University] (2019) http://t.cn/Aim2zTcV view:http://t.cn/Aim2zTc5
In this paper, we propose two novel preprocessing ideas to bypass this 惟(mnd) barrier: First, by preprocessing the initial weights of the neural networks, we can train the neural network in (m1螛(1/d)nd) cost per iteration. Second, by preprocessing the input data points, we can train ...
报告嘉宾:方聪 (北京大学) 报告时间:2022年04月06日 (星期三)晚上20:00 (北京时间) 报告题目:Convex Formulation of Overparameterized Deep Neural Networks 报告人简介: Cong Fang is an assistant professor at Peking University. He was a postdoctoral researcher with University of Pennsylvania in 2021 and ...
One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU...
We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima...
Convolutional layers are the core building blocks of Convolutional Neural Networks (CNNs). In this paper, we propose to augment a convolutional layer with an additional depthwise convolution, where each input channel is convolved with a different 2D kernel. The composition of the two convolutions ...
These techniques govern the optimization and generalization behaviors of ultra-wide neural networks. We provide a mathematical proof of VAE convergence under mild assumptions, thus advancing the theoretical understanding of VAE optimization dynamics. Furthermore, we establish a novel connection between the ...
This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-...
The inductive bias of a neural network is largely determined by the architecture and the training algorithm. To achieve good generalization, how to effectively train a neural network is of great importance. We propose a novel orthogonal over-parameterized training (OPT) framework that can provably ...