Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-17 of 17
Jürgen Schmidhuber
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Neural Computation (2023) 35 (4): 593–626.
Published: 18 March 2023
Abstract
View article
PDF
The discovery of reusable subroutines simplifies decision making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in an unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about subroutine boundary points in light of new incoming information. In this work, we propose slot-based transformer for temporal abstraction (SloTTAr), a fully parallel approach that integrates sequence processing transformers with a slot attention module to discover subroutines in an unsupervised fashion while leveraging adaptive computation for learning about the number of such subroutines solely based on their empirical distribution. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, even for sequences containing variable amounts of subroutines, while being up to seven times faster to train on existing benchmarks.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2022) 34 (11): 2232–2272.
Published: 07 October 2022
Abstract
View article
PDF
An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2022) 34 (4): 829–855.
Published: 23 March 2022
Abstract
View article
PDF
Under the Bayesian brain hypothesis, behavioral variations can be attributed to different priors over generative model parameters. This provides a formal explanation for why individuals exhibit inconsistent behavioral preferences when confronted with similar choices. For example, greedy preferences are a consequence of confident (or precise) beliefs over certain outcomes. Here, we offer an alternative account of behavioral variability using Rényi divergences and their associated variational bounds. Rényi bounds are analogous to the variational free energy (or evidence lower bound) and can be derived under the same assumptions. Importantly, these bounds provide a formal way to establish behavioral differences through an α parameter, given fixed priors. This rests on changes in α that alter the bound (on a continuous scale), inducing different posterior estimates and consequent variations in behavior. Thus, it looks as if individuals have different priors and have reached different conclusions. More specifically, α → 0 + optimization constrains the variational posterior to be positive whenever the true posterior is positive. This leads to mass-covering variational estimates and increased variability in choice behavior. Furthermore, α → + ∞ optimization constrains the variational posterior to be zero whenever the true posterior is zero. This leads to mass-seeking variational posteriors and greedy preferences. We exemplify this formulation through simulations of the multiarmed bandit task. We note that these α parameterizations may be especially relevant (i.e., shape preferences) when the true posterior is not in the same family of distributions as the assumed (simpler) approximate density, which may be the case in many real-world scenarios. The ensuing departure from vanilla variational inference provides a potentially useful explanation for differences in behavioral preferences of biological (or artificial) agents under the assumption that the brain performs variational Bayesian inference.
Includes: Supplementary data
Journal Articles
Publisher: Journals Gateway
Neural Computation (2021) 33 (6): 1498–1553.
Published: 13 May 2021
FIGURES
Abstract
View article
PDF
A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enabling sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this letter, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2012) 24 (11): 2994–3024.
Published: 01 November 2012
FIGURES
| View All (10)
Abstract
View article
PDF
We introduce here an incremental version of slow feature analysis (IncSFA), combining candid covariance-free incremental principal components analysis (CCIPCA) and covariance-free incremental minor components analysis (CIMCA). IncSFA's feature updating complexity is linear with respect to the input dimensionality, while batch SFA's (BSFA) updating complexity is cubic. IncSFA does not need to store, or even compute, any covariance matrices. The drawback to IncSFA is data efficiency: it does not use each data point as effectively as BSFA. But IncSFA allows SFA to be tractably applied, with just a few parameters, directly on high-dimensional input streams (e.g., visual input of an autonomous agent), while BSFA has to resort to hierarchical receptive-field-based architectures when the input dimension is too high. Further, IncSFA's updates have simple Hebbian and anti-Hebbian forms, extending the biological plausibility of SFA. Experimental results show IncSFA learns the same set of features as BSFA and can handle a few cases where BSFA fails.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2010) 22 (12): 3207–3220.
Published: 01 December 2010
FIGURES
Abstract
View article
PDF
Good old online backpropagation for plain multilayer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2007) 19 (3): 757–779.
Published: 01 March 2007
Abstract
View article
PDF
In recent years, gradient-based LSTM recurrent neural networks (RNNs) solved many previously RNN-unlearnable tasks. Sometimes, however, gradient information is of little use for training RNNs, due to numerous local minima. For such cases, we present a novel method: EVOlution of systems with LINear Outputs (Evolino). Evolino evolves weights to the nonlinear, hidden nodes of RNNs while computing optimal linear mappings from hidden state to output, using methods such as pseudo-inverse-based linear regression. If we instead use quadratic programming to maximize the margin, we obtain the first evolutionary recurrent support vector machines. We show that Evolino-based LSTM can solve tasks that Echo State nets (Jaeger, 2004a) cannot and achieves higher accuracy in certain continuous function generation tasks than conventional gradient descent RNNs, including gradient-based LSTM.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2000) 12 (10): 2451–2471.
Published: 01 October 2000
Abstract
View article
PDF
Long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs). We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down. Our remedy is a novel, adaptive “forget gate” that enables an LSTM cell to learn to reset itself at appropriate times, thus releasing internal resources. We review illustrative benchmark problems on which standard LSTM outperforms other RNN algorithms. All algorithms (including LSTM) fail to solve continual versions of these problems. LSTM with forget gates, however, easily solves them, and in an elegant way.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1999) 11 (3): 679–714.
Published: 01 April 1999
Abstract
View article
PDF
Low-complexity coding and decoding (LOCOCODE) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods, it explicitly takes into account the information-theoretic complexity of the code generator. It computes lococodes that convey information about the input data and can be computed and decoded by low-complexity mappings. We implement LOCOCODE by training autoassociators with flat minimum search, a recent, general method for discovering low-complexity neural nets. It turns out that this approach can unmix an unknown number of independent data sources by extracting a minimal number of low-complexity features necessary for representing the data. Experiments show that unlike codes obtained with standard autoencoders, lococodes are based on feature detectors, never unstructured, usually sparse, and sometimes factorial or local (depending on statistical properties of the data). Although LOCOCODE is not explicitly designed to enforce sparse or factorial codes, it extracts optimal codes for difficult versions of the “bars” benchmark problem, whereas independent component analysis (ICA) and principal component analysis (PCA) do not. It produces familiar, biologically plausible feature detectors when applied to real-world images and codes with fewer bits per pixel than ICA and PCA. Unlike ICA, it does not need to know the number of independent sources. As a preprocessor for a vowel recognition benchmark problem, it sets the stage for excellent classification performance. Our results reveal an interesting, previously ignored connection between two important fields: regularizer research and ICA-related research. They may represent a first step toward unification of regularization and unsupervised learning.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1997) 9 (8): 1735–1780.
Published: 15 November 1997
Abstract
View article
PDF
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O . 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1997) 9 (1): 1–42.
Published: 01 January 1997
Abstract
View article
PDF
We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a “flat” minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a “good” weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation's order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and “optimal brain surgeon/optimal brain damage.”
Journal Articles
Publisher: Journals Gateway
Neural Computation (1996) 8 (4): 773–786.
Published: 01 May 1996
Abstract
View article
PDF
Predictability minimization (PM—Schmidhuber 1992) exhibits various intuitive and theoretical advantages over many other methods for unsupervised redundancy reduction. So far, however, there have not been any serious practical applications of PM. In this paper, we apply semilinear PM to static real world images and find that without a teacher and without any significant preprocessing, the system automatically learns to generate distributed representations based on well-known feature detectors, such as orientation-sensitive edge detectors and off-center–on-surround detectors, thus extracting simple features related to those considered useful for image preprocessing and compression.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1993) 5 (4): 625–635.
Published: 01 July 1993
Abstract
View article
PDF
Prediction problems are among the most common learning problems for neural networks (e.g., in the context of time series prediction, control, etc.). With many such problems, however, perfect prediction is inherently impossible. For such cases we present novel unsupervised systems that learn to classify patterns such that the classifications are predictable while still being as specific as possible. The approach can be related to the IMAX method of Becker and Hinton (1989) and Zemel and Hinton (1991). Experiments include a binary stereo task proposed by Becker and Hinton, which can be solved more readily by our system.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1992) 4 (6): 863–879.
Published: 01 November 1992
Abstract
View article
PDF
I propose a novel general principle for unsupervised learning of distributed nonredundant internal representations of input patterns. The principle is based on two opposing forces. For each representational unit there is an adaptive predictor, which tries to predict the unit from the remaining units. In turn, each unit tries to react to the environment such that it minimizes its predictability. This encourages each unit to filter "abstract concepts" out of the environmental input such that these concepts are statistically independent of those on which the other units focus. I discuss various simple yet potentially powerful implementations of the principle that aim at finding binary factorial codes (Barlow et al . 1989), i.e., codes where the probability of the occurrence of a particular input is simply the product of the probabilities of the corresponding code symbols. Such codes are potentially relevant for (1) segmentation tasks, (2) speeding up supervised learning, and (3) novelty detection. Methods for finding factorial codes automatically implement Occam's razor for finding codes using a minimal number of units. Unlike previous methods the novel principle has a potential for removing not only linear but also nonlinear output redundancy. Illustrative experiments show that algorithms based on the principle of predictability minimization are practically feasible. The final part of this paper describes an entirely local algorithm that has a potential for learning unique representations of extended input sequences.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1992) 4 (2): 243–248.
Published: 01 March 1992
Abstract
View article
PDF
The real-time recurrent learning (RTRL) algorithm (Robinson and Fallside 1987; Williams and Zipser 1989) requires O ( n 4 ) computations per time step, where n is the number of noninput units. I describe a method suited for on-line learning that computes exactly the same gradient and requires fixed-size storage of the same order but has an average time complexity per time step of O ( n 3 ).
Journal Articles
Publisher: Journals Gateway
Neural Computation (1992) 4 (2): 234–242.
Published: 01 March 1992
Abstract
View article
PDF
Previous neural network learning algorithms for sequence processing are computationally expensive and perform poorly when it comes to long time lags. This paper first introduces a simple principle for reducing the descriptions of event sequences without loss of information . A consequence of this principle is that only unexpected inputs can be relevant. This insight leads to the construction of neural architectures that learn to “divide and conquer” by recursively decomposing sequences. I describe two architectures. The first functions as a self-organizing multilevel hierarchy of recurrent networks. The second, involving only two recurrent networks, tries to collapse a multilevel predictor hierarchy into a single recurrent net. Experiments show that the system can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1992) 4 (1): 131–139.
Published: 01 January 1992
Abstract
View article
PDF
Previous algorithms for supervised sequence learning are based on dynamic recurrent networks. This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: The first net learns to produce context-dependent weight changes for the second net whose weights may vary very quickly. The method offers the potential for STM storage efficiency: A single weight (instead of a full-fledged unit) may be sufficient for storing temporal information. Various learning methods are derived. Two experiments with unknown time delays illustrate the approach. One experiment shows how the system can be used for adaptive temporary variable binding.