Guided by our observations, we then present examples of overparameterized linear classifiers and neural networks trained by gradient descent (GD) where uniform convergence provably cannot "explain generalization" - even if we take into account the implicit bias of GD to the fullest extent possible. ...
Placing more power in the first few modes makes learning faster. When the labels have nonzero noise σ2 > 0 (Fig. 2d, e), generalization error is non-monotonic with a peak, a feature that has been named “double-descent”3,37. By decomposing Eg into the bias and the variance ...
can be found in Supplementary TablesS1, S2. For each empirical spectral tuning curve, the best fit parameters of the model are iteratively estimated using a standard gradient descent algorithm under the least squares estimation method. We consider a set ofNobservations of the activities of third ...