Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
Date
Availability
1-3 of 3
Philip M. Long
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Neural Computation (2022) 34 (6): 1488–1499.
Published: 19 May 2022
FIGURES
Abstract
View article
PDF
van Rooyen, Menon, and Williamson ( 2015 ) introduced a notion of convex loss functions being robust to random classification noise and established that the “unhinged” loss function is robust in this sense. In this letter, we study the accuracy of binary classifiers obtained by minimizing the unhinged loss and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2019) 31 (12): 2562–2580.
Published: 01 December 2019
FIGURES
Abstract
View article
PDF
We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to gaussian distributions. We show that if the activation function φ satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the “length process” converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases and the activation function φ . We also show that this convergence may fail for φ that violate our assumptions. We show how to use this analysis to choose the variance of weight initialization, depending on the activation function, so that hidden variables maintain a consistent scale throughout the network.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2019) 31 (3): 477–502.
Published: 01 March 2019
Abstract
View article
PDF
We analyze algorithms for approximating a function f ( x ) = Φ x mapping ℜ d to ℜ d using deep linear neural networks, that is, that learn a function h parameterized by matrices Θ 1 , … , Θ L and defined by h ( x ) = Θ L Θ L - 1 … Θ 1 x . We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix Φ , in the case where the initial hypothesis Θ 1 = … = Θ L = I has excess loss bounded by a small enough constant. We also show that gradient descent fails to converge for Φ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If Φ is symmetric positive definite, we show that an algorithm that initializes Θ i = I learns an ε -approximation of f using a number of updates polynomial in L , the condition number of Φ , and log ( d / ε ) . In contrast, we show that if the least-squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ satisfies u ⊤ Φ u > 0 for all u but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant u ⊤ Θ L Θ L - 1 … Θ 1 u > 0 for all u and the other that “balances” Θ 1 , … , Θ L so that they have the same singular values.