Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-5 of 5
Stephen José Hanson
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Neural Computation (2024) 36 (6): 1228–1244.
Published: 10 May 2024
Abstract
View article
PDF
Deep learning (DL), a variant of the neural network algorithms originally proposed in the 1980s (Rumelhart et al., 1986 ), has made surprising progress in artificial intelligence (AI), ranging from language translation, protein folding (Jumper et al., 2021 ), autonomous cars, and, more recently, human-like language models (chatbots). All that seemed intractable until very recently. Despite the growing use of DL networks, little is understood about the learning mechanisms and representations that make these networks effective across such a diverse range of applications. Part of the answer must be the huge scale of the architecture and, of course, the large scale of the data, since not much has changed since 1986. But the nature of deep learned representations remains largely unknown. Unfortunately, training sets with millions or billions of tokens have unknown combinatorics, and networks with millions or billions of hidden units can't easily be visualized and their mechanisms can't be easily revealed. In this letter, we explore these challenges with a large (1.24 million weights VGG) DL in a novel high-density sample task (five unique tokens with more than 500 exemplars per token), which allows us to more carefully follow the emergence of category structure and feature construction. We use various visualization methods for following the emergence of the classification and the development of the coupling of feature detectors and structures that provide a type of graphical bootstrapping. From these results, we harvest some basic observations of the learning dynamics of DL and propose a new theory of complex feature construction based on our results.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2020) 32 (5): 1018–1032.
Published: 01 May 2020
FIGURES
Abstract
View article
PDF
Multilayer neural networks have led to remarkable performance on many kinds of benchmark tasks in text, speech, and image processing. Nonlinear parameter estimation in hierarchical models is known to be subject to overfitting and misspecification. One approach to these estimation and related problems (e.g., saddle points, colinearity, feature discovery) is called Dropout. The Dropout algorithm removes hidden units according to a binomial random variable with probability p prior to each update, creating random “shocks” to the network that are averaged over updates (thus creating weight sharing). In this letter, we reestablish an older parameter search method and show that Dropout is a special case of this more general model, stochastic delta rule (SDR), published originally in 1990. Unlike Dropout, SDR redefines each weight in the network as a random variable with mean μ w i j and standard deviation σ w i j . Each weight random variable is sampled on each forward activation, consequently creating an exponential number of potential networks with shared weights (accumulated in the mean values). Both parameters are updated according to prediction error, thus resulting in weight noise injections that reflect a local history of prediction error and local model averaging. SDR therefore implements a more sensitive local gradient-dependent simulated annealing per weight converging in the limit to a Bayes optimal network. We run tests on standard benchmarks (CIFAR and ImageNet) using a modified version of DenseNet and show that SDR outperforms standard Dropout in top-5 validation error by approximately 13% with DenseNet-BC 121 on ImageNet and find various validation error improvements in smaller networks. We also show that SDR reaches the same accuracy that Dropout attains in 100 epochs in as few as 40 epochs, as well as improvements in training error by as much as 80%.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2008) 20 (2): 486–503.
Published: 01 February 2008
Abstract
View article
PDF
Over the past decade, object recognition work has confounded voxel response detection with potential voxel class identification. Consequently, the claim that there are areas of the brain that are necessary and sufficient for object identification cannot be resolved with existing associative methods (e.g., the general linear model) that are dominant in brain imaging methods. In order to explore this controversy we trained full brain (40,000 voxels) single TR (repetition time) classifiers on data from 10 subjects in two different recognition tasks on the most controversial classes of stimuli (house and face) and show 97.4% median out-of-sample (unseen TRs) generalization. This performance allowed us to reliably and uniquely assay the classifier's voxel diagnosticity in all individual subjects' brains. In this two-class case, there may be specific areas diagnostic for house stimuli (e.g., LO) or for face stimuli (e.g., STS); however, in contrast to the detection results common in this literature, neither the fusiform face area nor parahippocampal place area is shown to be uniquely diagnostic for faces or places, respectively.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2002) 14 (9): 2245–2268.
Published: 01 September 2002
Abstract
View article
PDF
A simple associationist neural network learns to factor abstract rules (i.e., grammars) from sequences of arbitrary input symbols by inventing abstract representations that accommodate unseen symbol sets as well as unseen but similar grammars. The neural network is shown to have the ability to transfer grammatical knowledge to both new symbol vocabularies and new grammars. Analysis of the state-space shows that the network learns generalized abstract structures of the input and is not simply memorizing the input strings. These representations are context sensitive, hierarchical, and based on the state variable of the finite-state machines that the neural network has learned. Generalization to new symbol sets or grammars arises from the spatial nature of the internal representations used by the network, allowing new symbol sets to be encoded close to symbol sets that have already been learned in the hidden unit space of the network. The results are counter to the arguments that learning algorithms based on weight adaptation after each exemplar presentation (such as the long term potentiation found in the mammalian nervous system) cannot in principle extract symbolic knowledge from positive examples as prescribed by prevailing human linguistic theory and evolutionary psychology.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2000) 12 (3): 531–545.
Published: 01 March 2000
Abstract
View article
PDF
A common misperception within the neural network community is that even with nonlinearities in their hidden layer, autoassociators trained with backpropagation are equivalent to linear methods such as principal component analysis (PCA). Our purpose is to demonstrate that nonlinear autoassociators actually behave differently from linear methods and that they can outperform these methods when used for latent extraction, projection, and classification. While linear autoassociators emulate PCA, and thus exhibit a flat or unimodal reconstruction error surface, autoassociators with nonlinearities in their hidden layer learn domains by building error reconstruction surfaces that, depending on the task, contain multiple local valleys. This interpolation bias allows nonlinear autoassociators to represent appropriate classifications of nonlinear multimodal domains, in contrast to linear autoassociators, which are inappropriate for such tasks. In fact, autoassociators with hidden unit nonlinearities can be shown to perform nonlinear classification and nonlinear recognition.