Gradient-flat regions also appear on an overparameterized loss. (A) Squared gradient norms across 500 iterations of Newton-MR for 50 separate runs on the loss of an overparameterized network (see section A.1 for details). Runs that terminate with squared gradient norm below 1e-8 in blue. Runs that terminate above that cutoff and with above 0.9, in orange. All other runs in black. Asterisks indicate trajectories in panel B. (B) The relative residual norm , for the approximate Newton update computed by MR-QLP at each iteration for three representative traces. Values are local averages with a window size of 10 iterations. Raw values are plotted with reduced opacity underneath. Top: nonflat, noncritical point (black). Middle: flat, noncritical point (orange). Bottom: flat, critical point (blue). (C) Empirical cumulative distribution functions for the final (top) and maximal (bottom) relative residual norm . Values above the cutoff for approximate gradient-flatness, , in orange. Observations from runs that terminated below the cutoff for critical points, 1e-8, indicated with blue ticks. (D) Loss and index for the points found after 500 iterations of Newton-MR. Colors as in top-left; only points with squared gradient norm below 1e-4 shown.
This site uses cookies. By continuing to use our website, you are agreeing to our privacy policy. No content on this site may be used to train artificial intelligence systems without permission in writing from the MIT Press.