## Abstract

We introduce here an incremental version of slow feature analysis (IncSFA), combining candid covariance-free incremental principal components analysis (CCIPCA) and covariance-free incremental minor components analysis (CIMCA). IncSFA's feature updating complexity is linear with respect to the input dimensionality, while batch SFA's (BSFA) updating complexity is cubic. IncSFA does not need to store, or even compute, any covariance matrices. The drawback to IncSFA is data efficiency: it does not use each data point as effectively as BSFA. But IncSFA allows SFA to be tractably applied, with just a few parameters, directly on high-dimensional input streams (e.g., visual input of an autonomous agent), while BSFA has to resort to hierarchical receptive-field-based architectures when the input dimension is too high. Further, IncSFA's updates have simple Hebbian and anti-Hebbian forms, extending the biological plausibility of SFA. Experimental results show IncSFA learns the same set of features as BSFA and can handle a few cases where BSFA fails.

## 1. Introduction

Slow feature analysis (SFA; Wiskott & Sejnowski, 2002; Wiskott, Berkes, Franzius, Sprekeler, & Wilbert, 2011) is an unsupervised learning technique that extracts features from an input stream with the objective that the feature responses must change over time but should change as slowly as possible. A temporal stability objective has been used in some other unsupervised learning techniques (Hinton, 1989; Földiák, 1991; Mitchison, 1991; Schmidhuber, 1992a; Bergstra & Bengio, 2009). SFA is different: the slow features are the lowest-order eigenvectors of an eigensystem based on the covariance matrix of input derivative measurements. The eigensystem formulation guarantees that its solution methods reliably converge to a best solution, avoiding entirely the issue of local minima. SFA has shown success in many problems and scenarios: extraction of driving forces of a dynamical system (Wiskott, 2003), nonlinear blind source separation (Sprekeler, Zito, & Wiskott, 2010), as a preprocessor for reinforcement learning (Legenstein, Wilbert, & Wiskott, 2010; Kompella, Pape, Masci, Frank, & Schmidhuber, 2011), and learning of place cells, head direction cells, grid cells, and spatial view cells from high-dimensional visual input (Franzius, Sprekeler, & Wiskott, 2007). Such representations also exist in biological agents (O'Keefe & Dostrovsky, 1971; Taube, Muller, & Ranck, 1990; Rolls, 1999; Hafting, Fyhn, Molden, Moser, & Moser, 2005).

A typical SFA implementation occurs in several stages. The data must all be collected, processed (e.g., nonlinearly expanded, outliers removed), whitened, the covariance matrix of the derivative measurements constructed, and its eigenvectors solved; these eigenvectors, in reverse order, are the slowest features. We present here incremental slow feature analysis (IncSFA; Kompella, Luciw, & Schmidhuber, 2011), which interleaves all the stages so that the slow features can be updated after each derivative measurement. A few earlier techniques with temporal continuity objective were incremental as well (Hinton, 1989; Bergstra & Bengio, 2009), but IncSFA follows the SFA formulation and can track solutions that would be uncovered by batch SFA (BSFA), over which it has the following advantages:

- •
*Adaptation to changing input statistics.*In the BSFA paradigm, new data cannot be used to modify already learned slow features. If input statistics change, IncSFA can adapt existing features without outside intervention, while BSFA has to discard previous features to process the new data. - •
*Computational efficiency.*BSFA techniques rely on batch principal component analysis (PCA; Jolliffe (1986)). For input observations in an*I*-dimensional space, the computational complexity of PCA using the Jacobi method (Forsythe & Henrici, 1958) is of the order*O*(*I*^{3}). IncSFA's updating complexity scales linearly with dimensionality (*O*(*I*)); thus, it has an advantage when the input dimension is large. - •
*Space complexity.*First, IncSFA can discard each observation immediately after an update. Second, we note that IncSFA uses covariance-free techniques, where the data covariance matrices never need to be computed, even in passing. In a covariance-free technique, the features are updated directly from the new data. The*I*(*I*+ 1)/2 parameters in the covariance matrix do not have to be estimated. - •
*Simplicity.*For extracting features from high-dimensional image sequences, IncSFA presents a simpler solution method than the alternate technique of deep receptive-field-based BSFA networks. By “simpler,” we mean that IncSFA has just a handful of parameters instead of the multitude of parameters associated with the deep nets. - •
*Reduced sensitivity to outliers.*Outlier observations in a data set can cause problems for BSFA, as these outliers can corrupt the slow features. In some cases, a feature may even become sensitive to an outlier. In a typical batch implementation, each observation has the same amount of influence on the features. In IncSFA, the influence of a single observation fades as newer observations are experienced. The learning rate implicitly controls this forgetting factor. Different learning rate settings can lead to different features—features that emerge from a high learning rate setting are biased to detect slowly changing phenomena that occur with more regularity than if a lower learning rate were to be used. - •
*Biological plausibility.*IncSFA adds further biological plausibility to SFA. SFA itself has been linked to biological systems due to the results in deriving place cell, grid cells, and so on, but it is difficult to see how BSFA could be realized in the brain. IncSFA's updates can be described in incremental Hebbian and anti-Hebbian forms (discussed in detail in section 5.4).

The disadvantage of IncSFA is that BSFA is better in terms of data efficiency: more input samples are needed for learning as compared to BSFA.

The remainder of this letter is organized as follows. Section 2 reviews SFA and its batch solution. Section 3 walks through the new incremental SFA. Section 4 has some illustrative experiments and results. Section 5 discusses supplementary issues, including convergence, parameter setting, and biological plausibility. Section 6 concludes the paper.

## 2. Background

### 2.1. SFA: Intuition.

In this section, we briefly review SFA in an intuitive sense and present its formulation and batch solution method (those who are already familiar with SFA can skip this section). SFA is a form of unsupervised learning (UL). Like some other feature extraction techniques, it searches for a set of mappings *g _{i}* from data to output components

*y*=

_{i}*g*(

_{i}**x**) that are separate from each other in some sense and express information that is in some sense relevant. In SFA the features are separated by mutual decorrelation of their outputs, while relevance is defined as minimal but nonzero change over time (slowness). Ordering the functions

*g*

_{1},

*g*

_{2}, …,

*g*by slowness, we can discard all but the

_{I}*J*<

*I*slowest, getting rid of irrelevant information such as quickly changing noise assumed to be useless. (See Figure 1 for a visual example of the meaning of a slow feature.)

SFA-based UL learns instantaneous features from sequential data (Hinton, 1989; Wiskott & Sejnowski, 2002; Doersch, Lee, Huang, & Miller, 2010). Relevance cannot be uncovered without taking time into account, but once it is known, each input frame can be encoded on its own. Due to this, SFA differs from both many well-known unsupervised feature extractors (Abut, 1990; Jolliffe, 1986; Comon, 1994; Lee & Seung, 1999; Kohonen, 2001; Hinton, 2002), which ignore dynamics, and other UL systems that both learn and apply features to sequences (Schmidhuber, 1992a, 1992b, 1992c; Lindstädt, 1993; Klapper-Rybicka, Schraudolph, & Schmidhuber, 2001; Jenkins & Matarić, 2004; Lee, Largman, Pham, & Ng, 2010; Gisslen, Luciw, Graziano, & Schmidhuber, 2011), thus assuming that the state of the system itself can depend on past information.

The compact relevant encodings uncovered by SFA reduce the search space for downstream goal-directed learning procedures (Schmidhuber, 1999; Barlow, 2001), especially reinforcement learning. As an example, consider a robot sensing with an onboard camera. Reinforcement learning algorithms applied directly to pixels can be quite inefficient due to the size of the search space. Slow features can encode each image into a small set of useful state variables, and the robot can use these few state variables to quickly develop useful control policies. The state variables from SFA are approximations of low-order eigenvectors of the graph Laplacian (Sprekeler, 2011), that is, proto-value functions (Mahadevan & Maggioni, 2007). This is why they are typically more useful as features in reinforcement learning instead of other types of features, such as principal components.

### 2.2. SFA: Formulation.

SFA's optimization problem (Wiskott & Sejnowski, 2002; Franzius et al., 2007) is formally written as follows:

*I*-dimensional sequential input signal

**x**(

*t*) = [

*x*

_{1}(

*t*), …,

*x*(

_{I}*t*)]

^{T}, find a set of

*J*instantaneous real-valued functions

**g**(

*x*) = [

*g*

_{1}(

**x**), …,

*g*(

_{J}**x**)]

^{T}, which together generate a

*J*-dimensional output signal

**y**(

*t*) = [

*y*

_{1}(

*t*), …,

*y*(

_{J}*t*)]

^{T}with

*y*(

_{j}*t*) ≔

*g*(

_{j}**x**(

*t*)), such that for each

*j*∈ {1, …,

*J*} under the constraints with 〈 · 〉 and indicating temporal averaging and the derivative of

*y*, respectively.

The problem is to find instantaneous functions *g _{j}* that generate different output signals varying as slowly as possible. Constraints 2.2 and 2.3 together avoid a trivial constant output solution. Decorrelation constraint 2.4 ensures that different functions

*g*do not code for the same features.

_{j}### 2.3. Batch SFA.

*g*are linear combinations of a finite set of nonlinear functions

_{j}**h**, then and the SFA problem now becomes finding weight vectors

**w**

_{j}to minimize the rate of change of the output variables, subject to the constraints 2.2 to 2.4. The slow feature learning problem has become linear on the derivative signal .

If the functions of **h** are chosen such that **z** has unit covariance matrix and zero mean, the three constraints will be fulfilled if and only if the weight vectors **w**_{j} are orthonormal. Equation 2.6 will be minimized, and the orthonormal constraint satisfied, with the set of *J* normed eigenvectors of with the *J* smallest eigenvalues (for any *J* ⩽ *I*).

The BSFA technique implements this solution by using batch principal component analysis (PCA) (Jolliffe, 1986) twice. Referring back to equation 2.6, to select **h** appropriately, a well-known process called whitening (or sphering), is used to map **x** to a **z** with zero mean and identity covariance matrix, thus decorrelating signal components and scaling them so that there is unit variance along each principal component (PC) direction. Whitening serves as a bandwidth normalization, so that slowness can truly be measured (slower change will not simply be due to a low variance direction). Whitening requires the PCs of the input signal (PCA 1). The orthonormal basis that minimizes the rate of output change are the minor components—principal components with smallest eigenvalues—in the derivative space. So another PCA (PCA 2) on yields the slow features (eigenvectors) and their order (via eigenvalues).

## 3. Incremental SFA

Like BSFA, IncSFA employs the eigenvector tactic but uses incremental algorithms for the two required PCAs. Therefore, IncSFA can update existing slow feature estimates on any amount of new data, even a single data point **x**(*t*).

To replace PCA 1, IncSFA needs to incrementally whiten the input **x**. We use the state-of-the-art candid covariance-free incremental (CCI) PCA (Weng, Zhang, & Hwang, 2003). CCIPCA incrementally updates both the eigenvectors and eigenvalues necessary for whitening and does not keep an estimate of the covariance matrix. It has been proven to converge to the true PCs (Zhang & Weng, 2001). CCIPCA is optionally used to reduce dimensionality at this intermediate stage by computing only the *K* highest-order eigenvectors.

Except for low-dimensional derivative signals , CCIPCA cannot replace PCA 2. It will take a long time to converge to the slow features, since they correspond to the least significant components. Minor components analysis (MCA) (Oja, 1992) incrementally extracts principal components but with a reversed preference: it finds it easiest to extract the components with the smallest eigenvalues. We use a modified version of Peng's low-complexity MCA updating rule (Peng, Yi, & Luo, 2007). Peng proved its convergence even for constant learning rates—good for open-ended learning. MCA with sequential addition (Chen, Amari, & Murata, 2001; Peng & Yi, 2006) will extract multiple slow features in parallel. In IncSFA, this method is modified to be covariance free.

**W**= (

**w**

_{1}, …,

**w**

_{J}) is the matrix of existing slow feature vector estimates for

*J*slow features, where each feature is a column vector in the matrix,

**V**= (

**v**

_{1}, …,

**v**

_{K}) is the matrix of

*K*principal component vector estimates used to construct the whitening matrix and for dimensionality reduction, is the input observation, and θ contains parameters about setting learning rates, which we discuss later. In general,

*K*<

*I*and

*J*<

*K*.

Next, we walk through the components of the algorithm.

### 3.1. Principal Components for Whitening.

The set of eigenvectors is orthonormal and ordered such that λ*_{1} ⩾ λ*_{2} ⩾ ⋅ ⋅ ⋅ ⩾ λ*_{K}.

The whitening matrix is generated by multiplying the matrix of principal component length-one eigenvectors **V*** by the diagonal matrix **D***, where component . After whitening using **z**(*t*) = **D*****V***^{T}**u**(*t*), the data will be normalized in scale and decorrelated, as seen by the fact that the covariance matrix will be the identity matrix: E[**zz**^{T}] = *I*.

### 3.2. CCIPCA Updating.

**u**

_{i}, the first PC is the expectation of the normalized response-weighted inputs. Equation 3.2 can be rewritten as

_{i}

**v***

_{i}is estimated by

**v**

_{i}(

*t*), is where 0 < η

^{PCA}< 1 is the learning rate. In other words, both the eigenvector and eigenvalue of the first PC of

**u**

_{i}can be found through the sample mean-type updating in equation 3.3. The estimate of the eigenvalue is given by λ

_{i}= ‖

**v**

_{i}(

*t*)‖. Using both a learning rate η

^{PCA}and retention rate (1 − η

^{PCA}) automatically makes this algorithm invariant to the magnitude of the input vectors.

### 3.3. Lower-Order Principal Components.

Any component *i*>1 must not only satisfy equation 3.2 but must also be orthogonal to the higher-order components. The residual method (Kreyszig, 1988; Sanger, 1989) generates observations in a complementary space so that lower-order eigenvectors can be found by the update rule of equation 3.4.

**u**

_{i}(

*t*) as the observation for component

*i*. When

*i*= 1,

**u**

_{1}(

*t*) =

**u**(

*t*). When

*i*>1,

**u**

_{i}is a residual vector, which has the “energy” of

**u**(

*t*) from the higher-order components removed. Solving for the first PC in this residual space solves for the

*i*th component overall. To create a residual vector,

**u**

_{i}is projected onto

**v**

_{i}to get the energy of

**u**

_{i}that

**v**

_{i}is responsible for. Then the energy-weighted

**v**

_{i}is subtracted from

**u**

_{i}to obtain

**u**

_{i+1}:

### 3.4. MCA Updating.

After using CCIPCA components to generate an approximately whitened signal **z**, the derivative is approximated by (for example) . In this derivative space, the minor components on are the slow features.

**C**

_{i}, ∀

*i*>1 as follows:

Note that equation 3.7 introduces parameter γ, which must be larger than the largest eigenvalue of . To automatically set γ, we compute the greatest eigenvalue of the derivative signal through another CCIPCA rule to update only the first PC. Then let γ = λ_{1}(*t*) + ε for small ε.

### 3.5. Covariance-Free MCA.

**w**

_{i}←

**w**

_{i}/‖

**w**

_{i}‖ (D. Peng, personal communication, August 2010). Normalization at least ensures nondivergence (see section 5.1). If we normalize, equation 3.8 can be rewritten in an even simpler form with retention rate and learning rate,

#### 3.5.1. Intermediate Dimensionality Reduction.

Since often only a relatively small number of principal components of **x** are needed to explain most of the variance in the data, the other components do not even have to be estimated. With IncSFA, dimensionality reduction can be done during PC estimation, and no time needs to be wasted on computing insignificant lower-order PCs. The whitening output dimension *K* must be set by hand.^{1} However, some prior problem knowledge seems necessary: the insignificant lower-order PCs may contain data corresponding to the slowest varying signal in the input. It would be unwise to remove them in this case, since discarding these might eliminate an important slow feature.

#### 3.5.2. On Complexity.

From the algorithm, it can be seen that IncSFA complexity with respect to input dimension is *O*(*I*).

With respect to the *K* eigenvectors after the CCIPCA step and *J* slow feature eigenvectors, every IncSFA update is of complexity *O*(*K* + *J*^{2}). The quadratic complexity on *J* is due to the Gram-Schmidt procedure in the CIMCA algorithm. CCIPCA uses the residual method, which has linear complexity. However, we expect *J* < *K* and *K* < *I*, so the quadratic complexity on *J* should not hurt much.

#### 3.5.3. Parameters.

There are just a handful. One must choose *K* and *J* and set a learning rate schedule. We provide some discussion on learning rates in section 5.2.

## 4. Experiments and Results

Experiments were done either using Python, based on the MDP toolbox (Zito, Wilbert, & Berkes, 2008), or Matlab.^{2}

### 4.1. Proof of Concept.

Both input components vary quickly over time (see Figure 2a). The slowest feature hidden in the signal is . The second slowest feature is .

Each epoch contains 2000 discrete data points, over the entire range of *t*, that are used for learning. A quadratic input expansion is done. A learning rate of η^{MCA} = 0.08 is used.

Both BSFA and IncSFA extract the correct features. Figure 2b shows the root mean square error (RMSE) of three IncSFA feature outputs compared to the corresponding BSFA outputs, over multiple epochs of training, showing that the IncSFA features converge to the correct ones. Figures 2c and 2g show feature outputs of batch SFA and (to the right) IncSFA outputs at 2, 5, and 10 epochs. Figures 2g to 2j show this comparison for the second feature. This basic result shows that it is indeed possible to extract multiple slow features in an online way without storing covariance matrices.

### 4.2. Feature Adaptation to a Changing Environment.

The purpose of this experiment is to illustrate how IncSFA's features adapt to an unpredicted sudden shift in the input process. The input used is the same signal as in experiment 1 but broken into two partitions. At epoch 60, the two input lines *x*_{1} and *x*_{2} are switched such that the *x*_{1} signal suddenly carries what *x*_{2} used to, and vice versa. IncSFA can first learn the slow features in the first partition, then is able to adapt to learn the slow features in the second partition.

Here, the signal is sampled 500 times per epoch. The CCIPCA learning rate parameters, also used to set the learning rate of the input average , were set to *t*_{1} = 20, *t*_{2} = 200, *c* = 4, *r* = 5000 (see section 5.2). The MCA learning rate is a constant η_{mca} = 0.01.

**w**(

*t*) and true (unit length) feature

**w***, The direction cosine equals one when the directions align (the feature is correct) and zero when they are orthogonal.

BSFA results are shown in Figure 4. The first batch feature somewhat catches the metadynamics and could be used to roughly sense the signal switch. However, the dynamics within each partition are not extracted. The BSFA result might be improved by generating embedding-vector time series (Wiskott, 2003) and increasing the nonlinear expansion. But due to the long duration of the signals and the unpredicted nature of the signal switch, time embedding with a fixed delay might not be able to recover the dynamics appreciably.

### 4.3. Recovery from Outliers.

Now we show the effect of a single extreme outlier on both BSFA and IncSFA. Again, the learning rate setup and basic signal from the previous experiments are used, with 500 samples per epoch, over 150 epochs. A single outlier point is inserted at time 100 (only in the first epoch!): *x*_{1}(100) = *x*_{2}(100) = 2000.

Figure 5 shows the first output signal of BSFA and IncSFA. The one outlier point at time 100 (out of 75,000) is enough to corrupt the first feature of BSFA, whereas IncSFA recovers. It is possible to include clipping Franzius et al. (2007) in BSFA, so that the effect of the outliers that have different variance statistics compared to the signal can be overcome.

Outliers that are generated from another signal source lying within the variance of the main signal can affect the BSFA output in a different way. We refer to a real-world experiment (Kompella, Pape, Masci, Frank, & Schmidhuber, 2011), using AutoIncSFA (where the input to IncSFA is the output at a bottleneck layer of an autoencoder neural net—for image compression) on an image sequence, in which a person moves back and forth in front of a stable camera. At one point in the training sequence, a door in the background is opened. The BSFA hierarchical network's first slow feature became sensitive to this event. Yet the AutoIncSFA network's first slow feature encodes the relative distance of the moving interactor.

### 4.4. High-Dimensional Video with Linear IncSFA.

Using IncSFA, SFA can be utilized in some high-dimensional video processing applications without using deep receptive-field-based networks. CCIPCA provides an intermediate dimensionality reduction, which, when low enough compared to the input dimension, can greatly reduce the computational and space complexities as well as the search space for the slow features via MCA.

As a first experiment to show this, we extract SFs from a rotating vision–based agent in a square room. The room has four complex-textured walls. (See Figure 6a.) Each image is dimension 41 × 41 × 3.

In each episode,^{3} starting from a different orientation, the agent rotates slowly (4 degree shifts from one image to the next) by 360 degrees, each episode. At any time, a slight amount of gaussian noise is added to the image (σ = 8).

Each 5043-dimensional image is fed into a linear IncSFA directly. Only the 40 most significant principal components are computed by CCIPCA, using learning rate parameters *t*_{1} = 20, *t*_{2} = 200, *c* = 4, *r* = 5000 (see section 5.2). Computation of the covariance matrix and its full eigendecomposition (over 5000 eigenvectors and eigenvalues) is therefore avoided. On the 40-dimensional whitened difference signal, only the first five slow features are computed via CIMCA.

Five hundred epochs through the data took approximately 15 minutes using Matlab on a machine with an Intel i3 CPU and 4 GB RAM. This is a frame rate of about 50 fps.

The results of projecting the (noise-free) data onto the first three slow features are shown in Figure 6b. A single linear IncSFA has incrementally compressed this high-dimensional noisy sequence to a nearly unambiguous compact form, learning to ignore the details at the pixel level and attend to the true cyclical nature underlying the image sequence. We say “nearly” since a few sub-sequences have somewhat ambiguous encodings, probably because certain images associated with slightly different angles are very similar.

### 4.5. iCub Experiment.

Again, we use a single layer of IncSFA on high-dimensional vision sequences. Two plastic cups are placed in the iCub robot's field of view. The robot performs motor babbling in one joint of its right arm, using a movement paradigm derived from Franzius et al. (2007). During the course of babbling, it happens to topple both cups, in one of two possible orders. The episode ends a few frames after it has knocked both down. A new episode begins with the cups upright again and the arm in the beginning position. Fifty separate episodes were recorded and the images used as training data.

IncSFA updates from each 80 × 60 (grayscale) image. Only the 20 most significant principal components are computed by CCIPCA, using learning rate parameters *t*_{1} = 20, *t*_{2} = 200, *c* = 2, *r* = 10000 (see section 5.2.) Only the first five slow features are computed via CIMCA with learning rate 0.001. The MCA vectors are normalized after each update during the first 10 episodes but not thereafter (for faster convergence). The algorithm runs for 400 randomly selected (of the 50 possible) episodes. We performed 25 independent runs.

Results are shown in Figure 7. The slowness of the feature outputs was measured on three “testing” episodes, after each episode of training. The upper-left plot shows that all five features get slower as they train over the 400 episodes. Figure 8 shows the average mutual direction cosine between nonidentical pairs of slow features, and we can see that the features quickly become nearly decorrelated.

After training completes, we embed the images in a lower dimension using the learned features. The embedding of trajectories of 20 different episodes is shown with respect to the first two PCs as well as the first two slow features. Since the cups being toppled or upright are the slow events in the scene, IncSFA's encoding is keyed on the object's state (toppled or upright). PCA does not find such an encoding, being much more sensitive to the arm. Such clear object-specific low-dimensional encoding, invariant to the robot's arm position, is useful, greatly facilitating training of a subse-quent regressor or reinforcement learner. (A video of the experimental result can be found online at http://www.idsia.ch/~luciw/IncSFAArm/IncSFAArm.html.)

### 4.6. Hierarchical IncSFA.

Here, for completeness, we test IncSFA within a deep receptive-field based network at all layers, although the utility of this approach is unclear.

Deep networks composed of multiple stacked BSFA nodes, each sensitive to only a small part of the input (i.e., receptive fields), are typically used for SFA processing high-dimensional images. The slow features will be linear combinations of the input space components. Since there is no guarantee that the useful information is linear in the original sensory space, an expanded space is often used. For example, a quadratic expansion adds all combinations of input components, or a cubic expansion adds all triples. But the degree of expansion required to construct a space where the interesting information will be some linear combination may increase the dimension intractably. What has been done to deal with these cases is to use multilayer, receptive-field-based networks (Wiskott & Sejnowski, 2002; Franzius et al., 2007), which reduce the complexity for any SFA module by partitioning spatially on each layer into receptive fields, while having a low-order (e.g., quadratic) expansion within each receptive field. A succession of low-order expansions over multiple layers leads to an overall expansion, which is high order.

Hierarchical networks introduce new parameters (e.g., receptive field size, number of layers) that can be difficult to tune. We have tried to show in the last two experiments that there is another applicable tactic: applying IncSFA monolithically to the (possibly even expanded) high-dimensional input, extracting *K* ≪ *I* principal components with CCIPCA and *J* slow features. But we can also use IncSFA within a deep network architecture.

Figure 9 shows an example deep network, motivated by the human visual system and based on the one specified by Franzius et al. (2007). The network is made up of a converging hierarchy of layers of IncSFA nodes, with overlapping rectangular receptive fields. Each IncSFA node finds the slowest output features from its input within the subspace of quadratically expanded inputs.

Input images come from a high-dimensional video stream generated by the iCub simulator (Tikhanoff et al., 2008), an OpenGL-based software specifically built for the iCub robot. Our experiment mimics the robot observing a moving interacter agent, which in the simulation takes the form of a rectangular flat board moving back and forth in depth over the range {1, 3} (meters) in front of the robot, using a movement paradigm based on that of Franzius. Figure 10a shows the experimental setup in the iCub simulator. Figure 10b shows a sample image from the data set. Twenty thousand monocular images are captured from the robot's left eye and downsampled to 83 × 100 pixels (input dimension of 8300).

A three-layer IncSFA network is used to encode the images. Each SFA node operates on a spatial receptive field of the layer below. The first layer uses 15 × 19 nodes, each with 10 × 10 image patch receptive field and a 5 pixel overlap. Each node on this layer develops 10 slow features. The second layer uses 4 × 5 nodes, each having a 5 × 5 receptive field, and developing 5 slow features. The third layer uses two nodes—one sensitive to the top half, the other sensitive to the bottom half (5 slow features). The fourth layer uses a single node and a single slow feature. The network is trained layer-wise from bottom to top, with the lower layers frozen once a new layer begins its training. The CCIPCA output of all nodes is clipped to [− 5, 5] to avoid any outliers that may arise due to close-to-zero eigenvalues in some of the receptive fields that contain unchanging stimuli. Each IncSFA node is trained individually, that is, there is no weight sharing among nodes.

For comparison, a BSFA hierarchical network was also trained on the data. Figure 10 shows BSFA and IncSFA outputs. The expected output is of the form of a sinusoid extending over the range of board positions. IncSFA gives a slightly noisy output, probably due to the constant dimensionality-reduction value for all units in each layer of the network, selected to maintain a consistent input structure for the subsequent layer. Hence, some units with eigenvectors corresponding to very small eigenvalues emerge in the first stage, with receptive fields observing comparatively few input changes, thus slightly corrupting the whitening result and adding small fluctuations to the overall result.

Finally, we evaluate how well the IncSFA feature codes for distance. A supervised quadratic regressor is trained with ground truth labels on 20% of the data set and tested on the other 80%, to measure the quality of features for some classifier or reinforcement learner using them. The RMSE was found to be equal to 0.043 meters.

## 5. Supplementary Topics

### 5.1. On Nondivergence and Convergence.

For CCIPCA, if the standard conditions on learning rate (Papoulis, Pillai, & Unnikrishna, 1965), including convergence at zero, the first-stage components will converge to the true PCs, leading to a “nearly correct” whitening matrix in reasonable time. So if input **x** is stationary, the slow feature estimates are likely to become quite close to the true slow features in a reasonable number of updates.

In open-ended learning, convergence is usually not desired. Yet when learning rate that is always nonzero is used, the stability of the algorithm is reduced. This corresponds to the well-known stability-plasticity dilemma (Grossberg, 1980).

**w**(0) is the initial feature estimate,

**w*** the true eigenvector associated with the smallest eigenvalue, and λ*

_{1}the largest eigenvalue. In other words, the learning rate must not be too large, and the initial estimate must not be orthogonal to the true component.

It is clear that if a whitened signal **z** is drawn from a stationary distribution, the MCA convergence proof (Peng et al., 2007) applies. But typically the whitening matrix is being learned simultaneously. In this early stage, while the CCIPCA vectors are learning, care must be taken to ensure that the slow feature estimates will not diverge.

**w**(0) within the set ,

**w**(

*t*)(∀

*t*⩾ 0) will remain in throughout the dynamics of the MCA updating. Thus, ‖

**w**‖ must be prevented from getting too large until the whitening matrix is close to accurate. With respect to lower-order slow features, there is additional dependence on the sequential addition technique, parameterized by γ(

*t*) = λ

_{1}(

*t*) + ε. This γ(

*t*) also needs time to estimate a close value to the first eigenvalue λ

_{1}. Before these estimates become reasonably accurate, the input can knock the vector out of .

In IncSFA, **w** is normalized after each update. If ‖**w**(0)‖ = 1, then any learning rate η_{mca} ⩽ 0.5 ensures nondivergence.

Even if **w** remains in , the additional constraint **w**^{T}(0)**w*** ≠ 0 is needed for convergence. But this is an easy condition to meet, as it is unlikely that any **w**(*t*) will be exactly orthogonal to the true feature. In practice, it may be advisable to add a small amount of noise to the MCA update. But we did not find this to be necessary.

### 5.2. Learning Rate Scheduling.

#### 5.2.1. Pseudo-Optimality.

For CCIPCA, the learning rate schedule is based around the optimal . If we use 1/*t*, equation 3.4 will be the most efficient estimator of the principal component. The most efficient estimator on average requires the smallest number of samples for learning (among all unbiased estimators). For several common distribution types (e.g., gaussian), the sample mean is the maximum likelihood estimator of the population mean. And observe that equation 3.4 reformulates the eigenvector estimation problem as a mean estimation problem. Therefore, equation 3.4 and learning rate 1/*t* have a spatiotemporal optimality: at any *t*, the estimate is expected to be the best as compared to any other unbiased estimator.

Learning rate 1/*t* is spatiotemporally optimal only if every sample from *t* = 1, 2, …, ∞ is drawn from the same distribution, which will not be the case for the lower-order components and in general for autonomous agents. We use an amnesic averaging technique, where the influence of old samples on the current estimates diminishes over time. We used the three-sectioned amnesic averaging function μ, shown in the algorithm. It uses three stages, defined by points *t*_{1} and *t*_{2}. In the first stage, the learning rate is . In the second, the learning rate is scaled by *c* to speed up learning of lower-order components. In the third, it changes with *t*, eventually converging to 1/*r*.

This amnesic average remains an unbiased estimator of the true PCs, and it allows components to adapt to changing input statistics. But this plasticity introduces an expected error that will not vanish with more samples (Weng & Zhang, 2006). This introduces some expected error into the IncSFA whitening process. Our results show that this is not problematic for many applications, but this typically leads to a slight oscillatory behavior around the true features.

#### 5.2.2. Preventing Divergence via the MCA Learning Rate.

To prevent divergence while CCIPCA is still learning, we used a slowly rising learning rate for MCA, starting from low η_{l} at *t* = 0 and rising to high η_{h} at *t* = *T*, as shown in algorithm 5. Ideally, *T* is a point in time when whitening has stabilized.

_{b}of permissible η

_{h}is related to the first condition in equation 5.1: where λ*

_{1}is the greatest eigenvalue of the signal. Constant values close to but below the bound will achieve faster convergence.

### 5.3. Other Methods of Neural Updating in PC and MC Extraction.

Neural layers that compute incremental PCA (IPCA) and MCA build on the work of Amari (1977) and Oja (1982). They showed that a linear neural unit using Hebbian updating could incrementally compute the first principal component of a data set (Amari, 1977; Oja, 1982).^{4} Many IPCA algorithms emerged after that. Some well-known ones are Oja and Karhunen's stochastic gradient ascent (SGA) (Oja, 1985), Oja's subspace algorithm (Oja, 1989), Sanger's generalized Hebbian algorithm (GHA) (Sanger, 1989), the weighted subspace algorithm (Oja, 1992), and CCIPCA. (For more information on comparisons, see Oja, 1992; Hyvärinen, Karhunen, & Oja, 2001; Weng et al., 2003). CCIPCA (Weng et al., 2003) modified GHA to be “candid” —meaning it is invariant to input vector magnitude; thus, learning rate tuning became more intuitive, which increased the practicality of the algorithm for high-dimensional inputs such as in appearance-based computer vision. Another recent IPCA algorithm adds GSO to CCIPCA (Park & Choi, 2008), so that the lower-order components should converge quicker, but with higher complexity.

We created IncSFA with experiments on high-dimensional data in mind. Thus, we chose CCIPCA for IncSFA for the following reasons:

- •
*Covariance free*. Due to high-dimensionality, it is important to keep space complexity down. - •
*Avoid Gram-Schmidt orthonormalization (GSO)*for enforcing orthogonality. Instead, we prefer the residual (Kreyszig, 1988) method. GSO will give more accurate result for lower-order components, but at quadratic (in the number of components) complexity. The residual method is local (linear complexity) but can be less accurate. Again, due to high-dimensionality, we felt it important to avoid quadratic complexity. Further, our experimental results showed that effective slow features could emerge even when the whitening matrix was not perfect. - •
*Both eigenvalues and eigenvalues needed*. We need a method that converges to both eigenvectors and eigenvalues, necessary since whitening requires both. - •
*Intuitive to tune the learning rate*. It is not practical to spend a lot of time tuning learning rates for every different type or set of data.

As for MCA, Xu, Oja, and Suen (1992) were the first to show that a linear neural unit equipped with anti-Hebbian learning could extract minor components. Oja (1992) modified SGA's updating method to an anti-Hebbian variant and showed how it could converge to the MC subspace. Studying the nature of the duality between PC and MC subspaces (Wang & Karhunen, 1996; Chen, Amari, & Lin, 1998), Chen et al. (2001) introduced the sequential addition technique. This enabled linear networks to efficiently extract multiple MCs simultaneously. Building on previous MCA algorithms, Peng et al. (2007) derived the conditions and a learning rule for extracting MCs for a constant learning rate. Sequential addition was added to this rule so that multiple MCs could be extracted (Peng & Yi, 2006).

We use a modified version of Peng's MCA updating method, slightly altered to be covariance free and using GSO (we simply call it CIMCA). Unlike simple MCA algorithms, Peng's MCA is a deterministic discrete time (DDT) method, which requires setting a constant learning rate to achieve convergence. The method has low computational complexity and is shown to work even for singular or near-singular correlation matrix of the input. This makes it practical and especially feasible for nonstationary data where the correlation coefficient behaves like a random variable.

CIMCA gives us the actual minor components (slow features), not just the subspace they span. It allows for a constant learning rate, which can be quite high, leading to a quick, reasonable estimate of the true components and making learning rate tuning more intuitive. Unlike at the IPCA stage, here GSO is useful and plausible since we expect it not to have too many features (minimizing the effect of the quadratic complexity) and we do not care about their magnitude.

There are many different ways of combining an incremental PCA and an incremental MCA. We present our reasons for selecting the methods in IncSFA above. Our motivation should be kept in mind: we intend to apply IncSFA on real-world image sequences and on vision-based robotic platforms, aiming toward autonomous learning, which is open-ended and continuous.

### 5.4. Links to Biological Systems.

BSFA has been shown to derive slow features that operate like biological grid cells from quasi-natural image streams, which are recorded from the camera of a moving agent exploring an enclosure (Franzius et al., 2007). In rats, grid cells are found in entorhinal cortex (EC) (Hafting et al., 2005), which feeds into the hippocampus. Place cells and head-direction cells are found in rat hippocampus (O'Keefe & Dostrovsky, 1971; Taube et al., 1990), while spatial view cells are found in primate hippocampus (Rolls, 1999). Augmenting the BSFA network with an additional competitive learning (CL) layer derives units similar to place, head direction, and spatial view cells.

Although BSFA results exhibit the above biological link, it is not clear how the full SFA technique might be realized in the brain. IncSFA with its Hebbian and anti-Hebbian updating provides a more biologically plausible implementation of the full SFA algorithm.

#### 5.4.1. Hebbian Updating in CCIPCA.

**u**represents presynaptic (input) activity and

*g*postsynaptic activity (a function of similarity between synaptic weights

**v**and input potentials

**u**). The basic equation 5.6 requires additional care (e.g., normalization of

**v**) to ensure stability during updating. To handle this in one step, learning rate η and retention rate 1 − η can be used, where 0 ⩽ η ⩽ 1. With this formulation, equation 3.4 is Hebbian, where the postsynaptic activity is the normalized response and the presynaptic activity is the input

**u**

_{i}.

#### 5.4.2. Anti-Hebbian Updating in CIMCA.

#### 5.4.3. Hebbian Learning on Filtered Output.

There is an alternative to using anti-Hebbian learning. Sprekeler, Michaelis, and Wiskott (2007) reformulated the slowness objective: instead of minimizing the variance of the time derivative of the output signal, they try to maximize the variance of the low-pass filtered output signal. They show analytically that the extraction of the single most slowly varying direction from prewhitened input can be implemented in a linear continuous model with spiking model neurons by means of a modified Hebbian learning rule with a specific learning window.

Hebbian learning between a temporally filtered output and input is the basis of several other temporal-stability based learning rules (Földiák, 1991; O'Reilly & Johnson, 1994; Wallis & Rolls, 1997). Links between these and slowness learning are provided by Sprekeler et al. (2007). However, although Sprekeler's method is only for the first feature, it might lead to alternate approach to reach a fully incremental SFA.

### 5.5. Velocity Estimates of the Input Signal.

The velocity estimates (the derivative signal) in the original SFA technique are approximated via a backward difference method . This method behaves badly in the presence of input noise compared to other methods (that are computationally expensive) such as higher-order difference estimation, cauchy's differentiation formula, or Lanczos derivative computation (Groetsch, 1998). However, noise is usually not a severe problem, since it changes at a faster timescale compared to the slowest components and therefore does not show up in the higher-order slow features. Therefore, we opted the same backward difference method for the IncSFA to keep it computationally simple.

## 6. Conclusion

This letter describes the novel incremental slow feature analysis technique, which updates slow features incrementally without computing covariance matrices. IncSFA's main advantages are low computational and space complexities. For many instances, there is no need to use IncSFA instead of BSFA. But at higher dimensionalities, IncSFA becomes more and more appealing. For some problems with very high-dimensionality and limited memory, IncSFA could be the only option (e.g., an autonomous robot with limited onboard hardware, which could still learn slow features from its visual stream with IncSFA). Experiments showed how IncSFA enables an adaptive SFA and how it enables SFA to be applied to high-dimensional image streams without using multilayer receptive-field-based BSFA architectures. IncSFA's Hebbian and anti-Hebbian updates add biological plausibility to SFA itself. Our future work aims at applying IncSFA to developmental robotics.

## Acknowledgments

The experimental paradigm used for the distance-encoding high-dimensional video experiment was developed by V.R.K. under the supervision of Mathias Franzius at the Honda Research Institute Europe. We acknowledge Dr. Franzius's contributions in this regard. We thank Alexander Forster and Kail Frank for enabling the experiment with the iCub robot. We thank the anonymous reviewers, whose comments helped improve the letter. Marijn Stollenga and Sohrob Kazerounian also provided useful comments. This work was funded through the 7th framework program of the EU under grants 231722 (IM-Clever project) and 270247 (Neural-Dynamics project) and by Swiss National Science Foundation grant CRSIKO-122697 (Sinergia project).

## References

*22*(pp.

*22*(pp.

## Notes

^{1}

One might set *K* such that 95% of the estimated total data variance is kept, for example.

^{2}

Python code is available online at www.idsia.ch/~kompella/codes/incsfa.html. Matlab code is available at www.idsia.ch/~luciw/incsfa.html.

^{3}

IncSFA can be readily extended to episodic tasks, with a minor modification: the derivative signal, which is computed as a difference over a single time step, is not computed for the starting sample of each episode. The first data point in each episode is used for updating the PCs but not the slow feature vectors.

^{4}

Earlier work of a nonneural network flavor had shown how the first PC, including the eigenvalue, could be learned incrementally (Krasulina, 1970).