Figure 5:
Gradient-flat regions also appear on an overparameterized loss. (A) Squared gradient norms across 500 iterations of Newton-MR for 50 separate runs on the loss of an overparameterized network (see section A.1 for details). Runs that terminate with squared gradient norm below 1e-8 in blue. Runs that terminate above that cutoff and with r above 0.9, in orange. All other runs in black. Asterisks indicate trajectories in panel B. (B) The relative residual norm r, for the approximate Newton update computed by MR-QLP at each iteration for three representative traces. Values are local averages with a window size of 10 iterations. Raw values are plotted with reduced opacity underneath. Top: nonflat, noncritical point (black). Middle: flat, noncritical point (orange). Bottom: flat, critical point (blue). (C) Empirical cumulative distribution functions for the final (top) and maximal (bottom) relative residual norm r. Values above the cutoff for approximate gradient-flatness, r>0.9, in orange. Observations from runs that terminated below the cutoff for critical points, ∥∇L(θ)∥2< 1e-8, indicated with blue ticks. (D) Loss and index for the points found after 500 iterations of Newton-MR. Colors as in top-left; only points with squared gradient norm below 1e-4 shown.

Gradient-flat regions also appear on an overparameterized loss. (A) Squared gradient norms across 500 iterations of Newton-MR for 50 separate runs on the loss of an overparameterized network (see section A.1 for details). Runs that terminate with squared gradient norm below 1e-8 in blue. Runs that terminate above that cutoff and with r above 0.9, in orange. All other runs in black. Asterisks indicate trajectories in panel B. (B) The relative residual norm r, for the approximate Newton update computed by MR-QLP at each iteration for three representative traces. Values are local averages with a window size of 10 iterations. Raw values are plotted with reduced opacity underneath. Top: nonflat, noncritical point (black). Middle: flat, noncritical point (orange). Bottom: flat, critical point (blue). (C) Empirical cumulative distribution functions for the final (top) and maximal (bottom) relative residual norm r. Values above the cutoff for approximate gradient-flatness, r>0.9, in orange. Observations from runs that terminated below the cutoff for critical points, L(θ)2< 1e-8, indicated with blue ticks. (D) Loss and index for the points found after 500 iterations of Newton-MR. Colors as in top-left; only points with squared gradient norm below 1e-4 shown.

Close Modal

or Create an Account

Close Modal
Close Modal