When weights in each layer are initialized from a Gaussian distribution \({\mathcal{N}}(0,{\sigma }_{W}^{2})\) and the size of hidden layers tend to infinity, the function f(x, θ) learned by training the network parameters θ with gradient descent on a squared loss to zero ...
Adap- tive subgradient methods for online learning and stochas- tic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011. 5 [17] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for d...