MIT Press

Figure 5:

Gradient-flat regions also appear on an overparameterized loss. (A) Squared gradient norms across 500 iterations of Newton-MR for 50 separate runs on the loss of an overparameterized network (see section A.1 for details). Runs that terminate with squared gradient norm below 1e-8 in blue. Runs that terminate above that cutoff and with $r$ above 0.9, in orange. All other runs in black. Asterisks indicate trajectories in panel B. (B) The relative residual norm $r$ ⁠, for the approximate Newton update computed by MR-QLP at each iteration for three representative traces. Values are local averages with a window size of 10 iterations. Raw values are plotted with reduced opacity underneath. Top: nonflat, noncritical point (black). Middle: flat, noncritical point (orange). Bottom: flat, critical point (blue). (C) Empirical cumulative distribution functions for the final (top) and maximal (bottom) relative residual norm $r$ ⁠. Values above the cutoff for approximate gradient-flatness, $r > 0.9$ ⁠, in orange. Observations from runs that terminated below the cutoff for critical points, ${∥ \nabla L (θ) ∥}^{2} <$ 1e-8, indicated with blue ticks. (D) Loss and index for the points found after 500 iterations of Newton-MR. Colors as in top-left; only points with squared gradient norm below 1e-4 shown.

This Feature Is Available To Subscribers Only