Abstract

Hierarchical temporal memory (HTM) is a biologically inspired framework that can be used to learn invariant representations of patterns in a wide range of applications. Classical HTM learning is mainly unsupervised, and once training is completed, the network structure is frozen, thus making further training (i.e., incremental learning) quite critical. In this letter, we develop a novel technique for HTM (incremental) supervised learning based on gradient descent error minimization. We prove that error backpropagation can be naturally and elegantly implemented through native HTM message passing based on belief propagation. Our experimental results demonstrate that a two-stage training approach composed of unsupervised pretraining and supervised refinement is very effective (both accurate and efficient). This is in line with recent findings on other deep architectures.

1.  Introduction

Hierarchical temporal memory (HTM) is a biologically inspired pattern recognition framework fairly unknown to the research community (George, 2008; George & Hawkins, 2009). It can conveniently be framed into multistage Hubel-Wiesel architectures (Ranzato, Huang, Boureau, & LeCun, 2007), a specific subfamily of deep architectures (Bengio, 2009; Jarrett, Kavukcuoglu, Ranzato, & LeCun, 2009; Arel, Rose, & Karnowski, 2010). HTM tries to mimic the feedforward and feedback projections thought to be crucial for cortical computation (Gilbert & Sigman, 2007; Summerfield & Egner, 2009). Bayesian belief propagation is used in a probabilistic hierarchical network to learn invariant spatiotemporal features of the input data, and theories exist for how such a graphical model could be mapped onto the cortical-thalamic anatomy (Lee & Mumford, 2003; George & Hawkins, 2009; Litvak & Ullman, 2009).

A comprehensive description of the HTM architecture and learning is provided in Maltoni (2011), where HTM also is shown to perform well on some pattern recognition tasks, even though further studies and validations are necessary. In Maltoni (2011), HTM is also compared to MLP (multilayer perceptron) and CN (convolutional network) on some pattern recognition problems.

The feedforward processing of HTM is closely related to hierarchical slow feature analysis (SFA) (Wiskott & Sejnowski, 2002; Franzius, Wilbert, & Wiskott, 2011). Similar to SFA, HTM employs a slowness objective during training with the goal of extracting invariant features from sequential training data. However, in contrast to SFA, where slowness is quantified by the temporal derivative of the output signals, HTM takes a Markovian graph partitioning perspective on slowness learning to learn how lower-level features are to be clustered to account for common transformations of the input. Although algorithmically quite different, the connection between the two approaches becomes evident with the insight that SFA can be viewed as a functional approximation of a Laplacian eigenmap (Belkin & Niyogi, 2003), where the underlying graph is defined by the temporal relations of the input samples (Sprekeler, 2011). In HTM the temporal graph is clustered directly using an Ncut (or similar) objective (Shi & Malik, 2000), while in SFA, the graph is used to find a “slow” subspace.

One limitation of classical HTM training and many other related algorithms, including hierarchical SFA, is that once a network is trained, it is hard to learn from new patterns without retraining it from scratch. In other words, a classical HTM is well suited for batch training based on a fixed training set and cannot effectively be trained incrementally with new patterns that were initially unavailable. In fact, every HTM level has to be trained individually and in sequence, starting from the bottom: altering the internal node structure at one network level (e.g., number of coincidences, groups) would invalidate the results of the training at higher levels. In principle, incremental learning could be carried out in a classical HTM by updating only the output level, but this is a naive strategy that in practice works only if the new incoming patterns are very similar to the existing ones in terms of building blocks.

Another limitation of classical HTM is low scalability. In fact, the number of coincidences and groups resulting from unsupervised training grows near linearly with the number of training patterns and the length of training sequences. In several practical applications, this can lead to complex (and slow) architectures.

Since incremental training and scalability are highly desirable properties of a learning system, we were motivated to investigate how HTM framework could be extended in these directions.

In this letter, we present a two-stage training approach: unsupervised temporal pretraining plus supervised refinement that can be used for incremental learning. A new HTM is initially pretrained (batch); then its internal structure is incrementally updated as new labeled samples become available. This kind of unsupervised pretraining and supervised refinement has been demonstrated to be successful for other deep architectures (Bengio, 2009).

The basic idea of our approach is to perform the batch pretraining using the algorithms described in Maltoni (2011) and fix the number of coincidences and groups throughout the whole network. During supervised refinement, we adapt the elements of the probability matrices in the output node and the intermediate nodes as if they were the weights of an MLP trained with backpropagation. With the same technique, we can also adapt the coincidences in nodes of the lowest intermediate level. Since these coincidence vectors can be conceived as pictorial words, the adaption can be viewed as a sort of dictionary refinement.

To implement the refinement, we first derived the update rules based on the descent of the error function. Since the HTM architecture is more complex than that of MLP, the resulting equations are not simple. Further complications arise from the fact that most of the manipulated scalars are probabilities and need to be normalized after each update step. Fortunately, we found a surprisingly simple (and computationally light) way to implement the whole process through native HTM belief propagation message passing.

The application of gradient descent optimization of probabilistic networks (e.g., Bayesian networks) has been investigated by others (Laskey, 1990; Neal, 1992; Binder, Koller, Russell, & Kanazawa, 1997; Jensen, 1999). Our work extends these results to an HTM pattern classifier whose structure and message passing, as pointed out in section 2.2, deviate from a classical Bayesian network.

Our experiments show promising results. Furthermore, the proposed two-stage approach not only enables incremental learning, but is also helpful for keeping network complexity under control, thus improving framework scalability.

A preliminary version of this work was presented at the ANNPR12 workshop (Maltoni & Rehn, 2012). Here we extend the incremental learning approach in several directions:

  • • 

    Temporal groups are allowed to vary with respect to the initial temporal clustering solution obtained by pretraining.

  • • 

    Coincidences at the lowest intermediate levels can now be included in the update process.

  • • 

    A number of simplifications are introduced to boost efficiency without affecting accuracy. As shown in section 4, these improvements make the overall approach much more effective especially when the number of training samples is small.

  • • 

    Discriminative fine-tuning is investigated in more depth, and the proposed approach is tested on real world data sets such as USPS and MNIST.

2.  Background

An HTM has a hierarchical tree structure. The tree is built up by nlevels levels (or layers), each composed of one or more nodes. A node in one level is bidirectionally connected to one or more nodes in the level below and the number of nodes in each level decreases as we ascend the hierarchy. The lowest level, , is the input level, and the highest level, , with typically only one node, is the output level. Levels and nodes in between input and output are called intermediate levels and nodes. When an HTM is used for visual inference, as is the case in this study, the input level typically has a retinotopic mapping of the input. Each input node is connected to one pixel of the input image and spatially close pixels are connected to spatially close nodes. Figure 1 shows a graphical representation of a simple HTM and its levels and nodes.

Figure 1:

A four-level HTM designed to work with 16 × 16 pixel images. Level 0 has 16 × 16 input nodes, each associated with a single pixel. Each level 1 node has 16 child nodes (arranged in a 4 × 4 region) and a receptive field of 16 pixels. Each level 2 node has 4 child nodes (2 × 2 region) and a receptive field of 64 pixels. Finally, the single output node at level 3 has 4 child nodes (2 × 2 region) and a receptive field of 256 pixels.

Figure 1:

A four-level HTM designed to work with 16 × 16 pixel images. Level 0 has 16 × 16 input nodes, each associated with a single pixel. Each level 1 node has 16 child nodes (arranged in a 4 × 4 region) and a receptive field of 16 pixels. Each level 2 node has 4 child nodes (2 × 2 region) and a receptive field of 64 pixels. Finally, the single output node at level 3 has 4 child nodes (2 × 2 region) and a receptive field of 256 pixels.

It is possible, and in many cases desirable, for an HTM to have an architecture where every intermediate node has multiple parents. This creates a network where nodes have overlapping receptive fields. Throughout this letter, a nonoverlapping architecture is used instead, where nodes have only one parent, to reduce computational complexity.

2.1.  Information Flow.

In an HTM, the flow of information is bidirectional. Belief propagation is used to pass messages both up (feedforward) and down (feedback) the hierarchy as new evidence is presented to the network. The notation used here for belief propagation (see Figure 2) follows Pearl (1988) and was adapted to HTMs by George (2008):

  • • 

    Evidence coming from below is denoted e. For visual inference, this is an image or video frame presented to level of the network.

  • • 

    Evidence from the top is denoted e+ and can be viewed as contextual information. This can, for instance, be input from other sensory modalities, high-level attention or emotion signals (Bar et al., 2006; Gilbert & Sigman, 2007; Summerfield & Egner, 2009), or the absolute knowledge given by the supervisor training the network.

  • • 

    Feedforward messages passed up the hierarchy are denoted by and feedback messages flowing down are denoted by .

  • • 

    Messages entering a node are denoted1 by and , and messages leaving a node are denoted by and ; hence, and are feedforward messages, while and are feedback messages. For messages entering and leaving a node from below, a subscript index can be used to identify the child node from/(to) which these messages are received/(sent).

Figure 2:

Notation for message passing between HTM nodes. Subscript indices are used to denote the child node from/(to) which messages are received/(sent).

Figure 2:

Notation for message passing between HTM nodes. Subscript indices are used to denote the child node from/(to) which messages are received/(sent).

When the purpose of an HTM is that of a classifier, the feedforward message of the output node is the posterior probability that the input, e, belongs to one of the problem classes. We denote this posterior by , where wi is one of nwclasses, i.e., .

2.2.  Internal Node Structure and Pretraining.

HTM training is performed level by level, starting from the first intermediate level. The input level does not need any training; it just forwards the input. Intermediate-level training is temporally supervised, guided by the temporal dynamics of the input (as in SFA), while the output-level training is supervised in the traditional sense using class labels. (For a detailed description, including pseudocode, refer to Maltoni, 2011.)

For every intermediate node, (see Figure 3), a set, C, of so-called coincidence-patterns (or just coincidences) and a set, G, of coincidence groups, have to be learned. A coincidence, ci, is a vector representing a prototypical activation pattern of the node's children. For nodes in , with input nodes as children, this corresponds to an image patch of the same size as the node's receptive field. For nodes higher up in the hierarchy, with intermediate nodes as children, each element of a coincidence, ci[h], is the index of a coincidence group in child h. Coincidence groups, also called temporal groups, are clusters of coincidences likely to originate from simple variations of the same input pattern. Coincidences found in the same group can be spatially dissimilar but likely to be found close in time when a pattern is smoothly moved through the node's receptive field. By clustering coincidences in this way, exploiting the temporal smoothness of the input, invariant representations of the input space can be learned (George, 2008). The assignment of coincidences to groups within each node is encoded in a probability matrix PCG, where each element, , represents the probability of a coincidence cj, given a group gi (or analogously, the likelihood of a group given a coincidence). These probabilities are the elements we will manipulate to incrementally train a network whose coincidences and groups have previously been learned and fixed.

Figure 3:

Graphical representation of the information processing within an intermediate node. The two central boxes, PCG and C, constitute the node memory at the end of training.

Figure 3:

Graphical representation of the information processing within an intermediate node. The two central boxes, PCG and C, constitute the node memory at the end of training.

George (2008) proposed the original algorithm used for temporal clustering in HTM, which indirectly maximizes an objective function related to temporal coincidence similarity. Maltoni (2011) instead proposed the MaxStab algorithm, which defines a within-group stability function similar to the Ncut objective (Shi & Malik, 2000) and tries to directly maximize it through a greedy heuristic. Moreover, in George (2008) the clustering is disjoint (i.e., each coincidence belongs to a single cluster), while in Maltoni (2011) clusters are fuzzy, that is, a coincidence belongs to a cluster with a given probability (encoded in PCG). Experiments reported in Maltoni (2011) show that MaxStab and Fuzzy clustering lead to better performance compared to the approach George (2008) proposed.

The structure and training of the output node differ from that of the intermediate nodes. In particular the output node does not have group, only coincidences. Instead of memorizing groups and group likelihoods, it stores a probability matrix PCW, whose elements represent the probability of coincidence cj given the class wi (or, analogously, the likelihood of a class given a coincidence). This is learned in a supervised fashion by counting how many times every coincidence is the most active one (“the winner”) in the context of each class. The output node also keeps a vector of class priors,P(wi), used to calculate the final class posterior.

HTM has a formulation that very closely resembles that of a Bayesian network. However, since the intermediate nodes are composed by both coincidences and groups, the message passing we will describe is a slightly modified version of the original belief propagation algorithm used for standard Bayesian networks (Pearl, 1988). An attempt to formulate a multistage Hubel-Wiesel architecture following a stricter definition of a Bayesian network, similar to HTM, has been made by Dura-Bernal, Wennekers, and Denham (2012).

2.3.  Feedforward Message Passing.

Inference in an HTM is conducted through feedforward belief propagation. When a node receives a set of messages from its m children, , a degree of certainty over each of the nc coincidences in the node is computed. This quantity is represented by a vector y and can be seen as the activation of the node coincidences. The degree of certainty over coincidence ci, , is
formula
2.1
The interpretation of the equation 2.1 depends on the level:
  • • 

    If the node belongs to level 1 (hence, its child nodes are input nodes), the input message is essentially an image patch and coincidences are prototype image patches. In this case, y[i] encodes the spatial similarity between two image patches and can be conveniently computed as a gaussian distance, where is a parameter controlling how quickly the activation level decays when deviates from ci.

  • • 

    If the node belongs to a level >1 (its child nodes are intermediate nodes), the input message is a probability vector and coincidences can be conceived as feature selectors: each element ci[j] is the index of a single group among the groups of the jth child node. In this case, y[i] is proportional to the probability of the co-occurrence of subevidence (each piece of subevidence coming from a child), in the context of ci. Assuming the subevidence to be independent, the probability is obtained by the product rule, being the subevidence coming from the jth child node.

If the node is an intermediate node, it computes its feedforward message , which is a vector of length ng corresponding to where G is the set of all coincidence groups in the node and ng is the cardinality of G. The component of vector is
formula
2.2
where nc is the number of coincidences stored in the node.
The feedforward message from the output node, the network output, is an nw-dimensional vector denoting the posterior class probability; component of this vector is computed in the following way:
formula
2.3
where is a normalization constant such that .

2.4.  Feedback Message Passing.

Given a message from the parent, , corresponding to , the top-down activation of each coincidence, z, is
formula
2.4
where the division by discounts the contribution of the node feedforward information from the incoming contextual information (Pearl, 1988).
The belief in coincidence is then given by
formula
2.5
where is a normalization constant such that .
The message sent by an intermediate node (belonging to a level to its children, , is computed using this belief distribution. The component of the message to a specific child node is
formula
2.6
where is the indicator function defined as
formula
2.7
The top-down message sent from the output node is computed in a similar way:
formula
2.8
Equations 2.6 and 2.8 will be important when we, in the next section, show how to incrementally update the PCGand PCW matrices to produce better estimates of the class posterior given some evidence from above.

3.  HTM Supervised Refinement

Here we introduce a novel way to optimize an already trained HTM. The algorithm, HSR (HTM supervised refinement), shares many features with traditional backpropagation used to train multilayer perceptrons and is inspired by weight fine-tuning methods applied to other deep belief architectures (Bengio, 2009). It exploits the belief propagation equations we have already presented to propagate an error message from the output node back through the network. This enables each node to locally update its internal probability matrix in a way that minimizes the difference between the estimated class posterior of the network and the posterior given from above, by a supervisor.

Our goal is to minimize the expected quadratic difference between the network output posterior given the evidence from below, e, and the posterior given the evidence from above, e+. To this purpose we employ empirical risk minimization (Vapnik, 1989) resulting in the following loss function:
formula
3.1
where nw is the number of classes, is the class posterior given the evidence from above, and is the posterior produced by the network using the input as evidence (i.e., inference). The loss function is also a function of all network parameters involved in the inference process. In most cases, e+ is a supervisor with absolute knowledge about the true class ; thus, .

To minimize the empirical risk, we first find the direction in which to alter the node probability matrices to decrease the loss and then apply gradient descent.

3.1.  Output Node Update.

For the output node, which does not memorize coincidence groups, we update the class likelihoods in the PCW matrix through the gradient descent rule:
formula
3.2
where is the learning rate. The negative gradient of the loss function is given by
formula
which can be shown (see appendix  A for a derivation) to be equivalent to:
formula
3.3
formula
3.4

where . We call Q(ws)the error message for class ws given some top-down and bottom-up evidence.

Finally, note that since PCW is a probability matrix, it has to be renormalized after each update; one has to ensure that for all .

3.2.  Intermediate Nodes Update.

For each intermediate node, we update probabilities in the PCG matrix through gradient descent:
formula
3.5
For intermediate nodes at level (i.e., the last before the output level) it can be shown (see appendix  B) that
formula
3.6
where is the child portion of the message sent from the output node to its children, but with Q(ws) replacing the posterior (compare equation 3.7 with equation 2.8
formula
3.7
Finally, it can be shown that this generalizes to all levels of an HTM and that all intermediate nodes can be updated using messages from their immediate parent. The derivation can be found in appendix  C. In particular, the error message from an intermediate node (belonging to a level to its child nodes is given by
formula
3.8

Note that PCG has to be renormalized after each update such that for all .

3.3.  Coincidence Update in the Lowest Intermediate Level.

The coincidences of higher intermediate levels are discrete vectors and can therefore not be updated using gradient descent. However, the coincidences of the lowest intermediate level, are continuous image patches. Following similar steps as above, it is therefore possible to update them in a way that will minimize the empirical loss. For each intermediate node in we update coincidences using gradient descent:
formula
3.9
where nd is the coincidence dimensionality (image patch size). The loss gradient with respect to element q of coincidence cp can be shown to take the following form (see appendix  D),
formula
3.10
where BelQ[p] is the “error belief” in coincidence cp defined as
formula
3.11

3.4.  HSR Pseudocode.

The results above allow us to define an efficient and elegant way to adapt the probabilities in an already trained HTM using belief propagation message passing. A batch version of the HSR algorithm is provided here as peudocode:
formula

By updating the probability matrices for every training example instead of at the end of the presentation of a group of patterns, an online version of the algorithm is obtained. Both batch and online versions of HSR are investigated in section 4.

In many cases, it is preferable for the nodes in lower intermediate levels to share memory—so-called node sharing (Maltoni, 2011). This speeds up training and forces all the nodes of the level to respond in the same way when the same stimulus is presented at different places in the receptive field. In a level operating in node -sharing mode, the PCG update, equation 3.5, must be performed only for the master node, that is, is averaged over all node positions.

4.  Experiments

To verify the efficacy of the HSR algorithm, we designed a number of experiments. These are performed using the SDIGIT data set, a machine-generated digit recognition problem (Maltoni, 2011). SDIGIT patterns (16 × 16 pixels, gray-scale images) are generated by geometric transformations of class prototypes called primary patterns. The possibility of randomly generating new patterns makes this data set suitable for evaluating incremental learning algorithms. By varying the amount of scaling and rotation applied to the primary patterns, we can also control the problem difficulty. Such a flexibility cannot be obtained by using other well-known data sets such as NORB (Jarrett et al., 2009) or CIFAR-10 (Krizhevsky, 2009).

With we denote a set of n patterns, including, for each of the 10 digits, the primary pattern and, further, (n/10)−1 patterns generated by scaling and rotation of the primary pattern according to random triplets (sx, sy, r) where , and .2 Note that during HTM, pretraining, each of the n patterns is translated into all possible locations according to a zigzag exploration strategy, thus providing the necessary temporal continuity (see Maltoni, 2011 for details).

The creation of a test set starts by translating each of the 10 primary patterns at all positions that allow it to be fully contained (with a 2 pixel background offset) in the 16 × 16 window, thus obtaining m patterns; then, for each of the m patterns, (n/10)−1 further patterns are generated by transforming the pattern according to random triplets (sx, sy, r). The total number of patterns in the test set is then m × n/10. Examples of generated test set SDIGIT patterns are shown in Figure 4.

Figure 4:

Example of SDIGIT test set patterns. Ten patterns from every class are shown.

Figure 4:

Example of SDIGIT test set patterns. Ten patterns from every class are shown.

Table 1 (reprinted from Maltoni, 2011) summarizes HTM performance on the SDIGIT problem and compares it against other well-known classification approaches. HTM accuracy is 71.37%, 87.56%, and 94.61% with 50, 100, and 250 training patterns, respectively. Our goal is to understand if and how accuracy can be improved by incrementally training the network through the HSR approach.

Table 1:
HTM Compared Against Other Techniques on the SDIGIT Problem.
Accuracy (%)Time (hh:mm:ss)
Training setApproachTrainingTestTrainingTestSize (MB)
 NN 100 57.92 <1 sec 00:00:04 3.50 
(50,0.70,1.0,0.7,1.0,40MLP 100 61.15 00:12:42 00:00:03 1.90 
1788 translated patterns LeNet5 100 67.28 00:07:13 00:00:11 0.39 
 HTM 100 71.37 00:00:08 00:00:13 0.58 
 NN 100 73.63 <1 sec 00:00:07 6.84 
(100,0.70,1.0,0.7,1.0,40MLP 100 75.37 00:34:22 00:00:03 1.90 
3423 translated patterns LeNet5 100 79.31 00:10:05 00:00:11 0.39 
 HTM 100 87.56 00:00:25 00:00:23 1.00 
 NN 100 86.50 <1 sec 00:00:20 17.0 
(250,0.70,1.0,0.7,1.0,40MLP 99.93 86.08 00:37:32 00:00:03 1.90 
8705 translated patterns LeNet5 100 89.17 00:14:37 00:00:11 0.39 
 HTM 100 94.61 00:02:04 00:00:55 2.06 
Accuracy (%)Time (hh:mm:ss)
Training setApproachTrainingTestTrainingTestSize (MB)
 NN 100 57.92 <1 sec 00:00:04 3.50 
(50,0.70,1.0,0.7,1.0,40MLP 100 61.15 00:12:42 00:00:03 1.90 
1788 translated patterns LeNet5 100 67.28 00:07:13 00:00:11 0.39 
 HTM 100 71.37 00:00:08 00:00:13 0.58 
 NN 100 73.63 <1 sec 00:00:07 6.84 
(100,0.70,1.0,0.7,1.0,40MLP 100 75.37 00:34:22 00:00:03 1.90 
3423 translated patterns LeNet5 100 79.31 00:10:05 00:00:11 0.39 
 HTM 100 87.56 00:00:25 00:00:23 1.00 
 NN 100 86.50 <1 sec 00:00:20 17.0 
(250,0.70,1.0,0.7,1.0,40MLP 99.93 86.08 00:37:32 00:00:03 1.90 
8705 translated patterns LeNet5 100 89.17 00:14:37 00:00:11 0.39 
 HTM 100 94.61 00:02:04 00:00:55 2.06 

Notes: SDIGIT: test set: , (6200 patterns, 10 classes). The table is reprinted from Maltoni (2011). Three experiments are performed with an increasing number of training patterns: 50, 100, and 250. The test set is common across the experiments and includes 6200 patterns. NN, MLP, and LeNet5 refer to nearest neighbor, multilayer perceptron, and convolutional network, respectively. HTM refers to a four-level hierarchical temporal memory (whose architecture is shown in Figure 1) trained in MaxStab configuration with Fuzzy grouping enabled (see, Maltoni, 2011, for details). The best accuracy for each training set is in bold.

To this purpose, we employ an experimental procedure where we first pretrain a new network using a data set (with n patterns) and then, for a number of epochs,ne, generate new data sets and apply HSR. At each epoch, one can apply HSR for several iterations, to favor convergence. However, we experimentally found that a good trade-off between convergence time and overfitting can be achieved by performing just two HSR iterations for each epoch (niter=2). The classification accuracy is calculated using the patterns generated for every epoch, but before the network is updated using those patterns. In this way, we emulate a situation where the network is trained on sequentially arriving patterns.

The experimental procedure can be summarized with this pseudocode:
formula

4.1.  HSR Configurations and Training Strategies.

The HSR algorithm described in section 3.4 can be (computationally) simplified by excluding some of the variables from the update procedure or by annihilating negligible values after each update. The following configurations will be considered throughout the experiments:

  • • 

    Partial update. The full HSR approach, as described in section 3.4, updates the PCW matrix at the output level, the PCG matrices at intermediate levels, and coincidences, C, at . In the following, we will also test configurations where some of the above data are excluded from the HSR update.

  • • 

    Fixed versus variable groups. Equation 3.5 requires that each PCGpq element ( is updated, potentially leading to a large number of updates. Since PCG encodes the clustering solution (i.e., coincidence group memberships), changing a zero element to a nonzero value will alter the initial (pretraining) cluster assignment. One possible simplification consists of keeping the fuzzy coincidence group clustering fixed. In the following, we refer to the case where only the PCGpq elements that were nonzero after pretraining are updated by HSR as fixed groups and with variable groups the case where all PCGpq are updated.

  • • 

    Top percent groups. When this option is enabled, the update is limited to PCGpq elements for which y[p] is large. In fact, when equation 3.7 is considered, the amount of update is proportional to y[p], and we can assume that the gradient is negligible when y[p] is small. To implement this strategy, we compute and sort y[p] at each node update. Starting from the top value, we then keep only y[p] elements until their sum exceeds a certain percentage, tpg, of the whole y[p] sum (tpg being a parameter).

  • • 

    Sparse PCG. When this option is enabled, the PCG matrices are sparsified after each update. This is performed by retaining, for each coincidence p, only the groups to which p has the strongest association. To this purpose, we sort PCGpq for each p, and then, starting from the top (i.e., the most probable), we keep on adding groups until the cumulative probability is above spcg (spcg being a parameter). Note that the sparsification must be performed before the renormalization of the PCG matrices (see the pseudocode in section 3.4).

Independent of the HSR configuration, different strategies and parameters can be adopted for the HSR training:

  • • 

    Batch versus online updating. See section 3.4.

  • • 

    Error versus all selection strategy. For the error selection strategy, supervised refinement is performed only for patterns that were misclassified by the current HTM (i.e., in the previous iteration), while for the all selection, strategy updating is performed for all patterns.

  • • 

    Learning rate. In general the effect on performance of the learning rate is similar to other neural network architectures (e.g., MLP). A nonmonotonic convergence (suffering from oscillations) is a symptom of too high a learning rate; in this case, the network convergence is not guaranteed. But a very slow convergence is the result of a learning rate that is too small. One striking finding of our experiments is that the learning rate (see equations 3.2, 3.5, and 3.9) for the output node should be kept much lower than for the intermediate nodes and the learning rate for the intermediate nodes much lower than learning rate for coincidences. In the following, we refer to the learning rate at the output node as , to the learning rate for intermediate nodes as , and to the learning rate at coincidences as . We experimentally found that the optimal learning rates (for the SDIGIT data set) are , , and . In any case, selecting proper values for the learning rates is not critical, and comparable accuracy levels can be achieved by using different learning rates in a certain range.

4.2.  HSR Accuracy.

4.2.1.  Basic Configuration.

In the basic configuration, update includes PCW at the output level and PCG at intermediate levels, while coincidences at are excluded. Groups are kept fixed, and the options top percent groups and sparse PCG are not enabled. This is the same configuration as tested in Maltoni and Rehn (2012). There, as here, the training and update sets are:
formula

In particular, each update set is composed of 6200 patterns; examples of the resulting pattern variations are shown in Figure 4. This can be considered a medium-difficult problem because of the pattern variation and interclass similarity (rotated 1 and 7; small 6, 8, and 9), but the high number of training patterns makes the iterative learning quite effective. The accuracy gain due to HSR is studied for different strategies: batch versus online and error versus all. Figure 5 shows the accuracy gain achieved by HSR over 20 epochs of incremental learning, starting from an HTM pretrained with n=50 patterns. Accuracy at epoch 1 corresponds to the accuracy after pretraining, which is about 72%. A few epochs of HSR training are then sufficient to raise accuracy to 93% to 95%. The growth then slows and, after 20 epochs, the network accuracy is in the range 97% to 98.5% for the different configurations (see Table 2). It is worth remembering that the accuracy reported for each epoch is always measured using previously unseen data.

Figure 5:

HSR accuracy (basic configuration) over 20 epochs for different stra-tegies, starting with an HTM pretrained with n=50 patterns. Each point is the average of 20 runs (95% mean confidence intervals are plotted). Each update set contains 6,200 patterns.

Figure 5:

HSR accuracy (basic configuration) over 20 epochs for different stra-tegies, starting with an HTM pretrained with n=50 patterns. Each point is the average of 20 runs (95% mean confidence intervals are plotted). Each update set contains 6,200 patterns.

Training over all the patterns (with respect to training over misclassified patterns only) provides a small advantage (1% to 2%). Online updating seems to yield slightly better performance during the first few epochs, but then the accuracies of online and batch updating are almost equivalent.

Table 2 compares the computation time of different strategies. Applying supervised refinement only to misclassified patterns significantly reduces computation time, while switching between batch and online strategies is not relevant for efficiency. Therefore, considering that accuracy of the errors strategy is not far from the all strategy, the errors strategy is a good option when training on large data sets.

Table 2:
Average HSR Accuracy (at Epoch 20) and Computation Times for the Basic Configuration.
HSR Time:
AccuracyEntire data set (6200 patterns),HSR Time:
Strategy(epoch 20)1 iteration1 pattern, 1 iteration
Batch, all 98.49 % 19.27 sec 3.11 ms 
Batch, error 97.05 % 8.37 sec 1.35 ms 
Online, all 98.11 % 22.75 sec 3.66 ms 
Online, error 96.97 % 8.27 sec 1.34 ms 
HSR Time:
AccuracyEntire data set (6200 patterns),HSR Time:
Strategy(epoch 20)1 iteration1 pattern, 1 iteration
Batch, all 98.49 % 19.27 sec 3.11 ms 
Batch, error 97.05 % 8.37 sec 1.35 ms 
Online, all 98.11 % 22.75 sec 3.66 ms 
Online, error 96.97 % 8.27 sec 1.34 ms 

Note: Time values refer to our C# (.Net) implementation under Windows 7 on a Xeon CPU W3550 at 3.07 GHz.

4.2.2.  From Fixed to Variable Groups.

In this round of experiments, we start from the previous configuration with the goal of studying the impact of using variable groups. To make the incremental learning more difficult, we reduce the size of each set with 1/10, that is, ; hence, each update set is now composed of 620 patterns (instead of 6,200). Figure 6 compares the accuracy of fixed groups and variable groups for both the batch all and batch errors training strategies. It is worth noting that:

  • • 

    The accuracy of fixed groups is 7 to 10 percentage points worse with respect to the previous experiment due to the reduction in the update set size.

  • • 

    The accuracy of variable groups is significantly better than fixed groups and reaches an accuracy comparable to the previous experiment 97% to 99% (even though the update sets are 10 times larger).

  • • 

    The gap between batch all and batch errors remains small for variable groups (1% to 2% after 10 epochs) but increases for fixed groups (about 5%) as a consequence of the reduction in the number of patterns.

Figure 6:

HSR accuracy over 20 epochs for different configurations, starting with an HTM pretrained with n=50 patterns. Each point is the average of 20 runs (95% mean confidence intervals are plotted). Each update set here contains 620 patterns.

Figure 6:

HSR accuracy over 20 epochs for different configurations, starting with an HTM pretrained with n=50 patterns. Each point is the average of 20 runs (95% mean confidence intervals are plotted). Each update set here contains 620 patterns.

These experiments show that the iterative adaption of the fuzzy clustering found during pretraining, which is implicit in the variable groups configuration, is an effective approach to improve the accuracy. This configuration also allows us to limit the update to the error patterns without degrading accuracy much. However, a closer look at the experiment statistics reveals that the extra accuracy comes at the expense of training efficiency and network complexity (compare the first two rows of Table 3). In particular:

  • • 

    HSR computing time (average time for one iteration over 620 patterns) takes 0.95 and 19.03 seconds for fixed and variable groups, respectively. This is mainly due to the much higher number of PCGpq elements that must be updated. Note that for fixed groups, the iteration time is much lower than that reported in Table 2 because of the smaller size of the update sets.

  • • 

    For fixed groups, the average PCG density (percentage of nonzero elements) across intermediate nodes is 0.98% and 2.12% at levels 1 and 2, respectively. Since only nonzero PCGpq elements are updated, the average sparsity remains the same across epochs independent of the learning strategy. For the variable groups configuration, the average PCG density grows to 90.14% and 71.66% at levels 1 and 2, respectively. The data above refer to the batch error strategy and are measured at epoch 20. In fact, the experiments show that the average density remains quite stable during the different epochs. In conclusion, variable groups lead to a dramatic increase in PCG density, thus increasing network complexity. Unfortunately, a complex network (large and dense PCG) is memory demanding and makes message passing (both feedforward and feedback) less efficient.

An effective way to achieve good accuracy without sacrificing efficiency and complexity is to enable the top percent groups and sparse PCG options. Figure 7 shows the accuracy of this configuration for different parameters: tpg and spcg (see section 4.1). Rows 3 to 5 in Table 3 denote the corresponding efficiency and complexity. In particular the parameterization tpg = 50% and spcg = 97% preserves the accuracy quite well of variable groups and at the same time exhibits an efficiency and complexity not far from that of fixed groups.

Figure 7:

HSR accuracy over 20 epochs when the top percent groups and sparse PCG options are enabled. Each point is the average of 20 runs.

Figure 7:

HSR accuracy over 20 epochs when the top percent groups and sparse PCG options are enabled. Each point is the average of 20 runs.

Table 3:
Average HSR Accuracy, Computation Times, and PCG Density for Different Configurations.
Top PercentHSR TimePCG density
Fixed versusGroupsSparse PCGAccuracy(620 patterns),
UpdateVariable Groups(tpg)(spcg)(epoch 20)1 Iteration(epoch 20)
PCW+PCG Fixed groups Not enabled Not enabled 88.86% 0.95 sec 0.98%, 2.12% 
PCW+PCG Variable groups Not enabled Not enabled 97.92% 19.03 sec 90.14%, 71.66% 
PCW+PCG Variable groups 50% Not enabled 98.00% 5.04 sec 22.71%, 27.34% 
PCW+PCG Variable groups 25% Not enabled 97,28% 3.37 sec 12.65%, 17.99% 
PCW+PCG Variable groups 50% 97% 97.56% 1.83 sec 1.89%, 4.04% 
PCW+PCG+C Variable groups 50% 97% 97.70% 2.06 sec 2.05%, 4.35% 
Top PercentHSR TimePCG density
Fixed versusGroupsSparse PCGAccuracy(620 patterns),
UpdateVariable Groups(tpg)(spcg)(epoch 20)1 Iteration(epoch 20)
PCW+PCG Fixed groups Not enabled Not enabled 88.86% 0.95 sec 0.98%, 2.12% 
PCW+PCG Variable groups Not enabled Not enabled 97.92% 19.03 sec 90.14%, 71.66% 
PCW+PCG Variable groups 50% Not enabled 98.00% 5.04 sec 22.71%, 27.34% 
PCW+PCG Variable groups 25% Not enabled 97,28% 3.37 sec 12.65%, 17.99% 
PCW+PCG Variable groups 50% 97% 97.56% 1.83 sec 1.89%, 4.04% 
PCW+PCG+C Variable groups 50% 97% 97.70% 2.06 sec 2.05%, 4.35% 

Notes: Time values refer to our C# (.Net) implementation under Windows 7 on a Xeon CPU W3550 at 3.07 GHz. The strategy Batch, Errors was used for all experiments.

4.2.3.  Enabling Coincidence Update.

In this round of experiments, we enable coincide update at level 1, and we observe that:

  • • 

    The benefit in terms of accuracy is negligible; see the last row in Table 3 and Figure 8.

  • • 

    The extra complexity is small since coincidences are shared across nodes and therefore the amount of extra data to update is limited.

In order to better understand which specific updates are most relevant for the accuracy improvement seen in Figure 8, we also report results achieved with configurations characterized by partial update. We note that coincidence update is relevant when groups are fixed but become negligible otherwise. Variable groups are crucial for HSR, and this underlines the importance of temporal groups in HTM. In particular, group update at seems to be the most relevant one; this is probably due to the difficulty of finding (during unsupervised pretraining) an optimal temporal clustering solution at high levels due to the increasing dimensionality of the pattern space higher up in the hierarchy.

Figure 8:

HSR accuracy over 20 epochs for configurations characterized by partial update.

Figure 8:

HSR accuracy over 20 epochs for configurations characterized by partial update.

4.3.  HTM Scalability.

One drawback of the current HTM framework is scalability. In fact, the network complexity considerably increases with the number and dimensionality of the training data. All the experiments reported in Maltoni (2011) clearly show that the number of coincidences and groups rapidly increases with the number of patterns in the training sequences. Table 4 shows the accuracy and the total number of coincidences and groups in an HTM pretrained with an increasing number of patterns. As expected, the accuracy increases with the training set size, but after 250 patterns, the accuracy improvement slows down while the network memory (coincidences and groups) continues to grow markedly, leading to bulky networks.

Table 4:
HTM Statistics after Pretraining.
Number of PretrainingAccuracy afterCoincidence and
patternsPretrainingGroups
50 71.37% 7193; 675 
100 87.56% 13,175; 1185 
250 94.61% 29,179; 2460 
500 93.55% 53,127; 4215 
750 96.97% 73,277; 5569 
1000 97.44% 92,366; 6864 
Number of PretrainingAccuracy afterCoincidence and
patternsPretrainingGroups
50 71.37% 7193; 675 
100 87.56% 13,175; 1185 
250 94.61% 29,179; 2460 
500 93.55% 53,127; 4215 
750 96.97% 73,277; 5569 
1000 97.44% 92,366; 6864 

Note: The first three rows are consistent with Table 3 of Maltoni (2011).

Figure 9 shows the accuracy improvement by HSR for HTMs pretrained over 50, 100 and 250 patterns. For this experiment we use configuration 5 in Table 3, which is not the most accurate but represents a good trade-off between accuracy and efficiency. It is worth remembering that HSR does not alter the number of coincidences and groups in the pretrained network. Therefore, the complexity in terms of number of coincidence and groups after any number of epoch is the same of all the pretrained HTMs (refer to Table 4). Furthermore, since configurations with top percent groups and sparse PCG enabled allows us to keep the PCG density low, we can conclude that HSR does not significantly increase the overall HTM model complexity.

Figure 9:

HSR accuracy over 20 epochs when using an HTM pretrained with 50, 100, and 250 patterns. The HSR configuration is 5 of Table 3. Each point is the average of 20 runs.

Figure 9:

HSR accuracy over 20 epochs when using an HTM pretrained with 50, 100, and 250 patterns. The HSR configuration is 5 of Table 3. Each point is the average of 20 runs.

In Figure 9 it is interesting to note that the accuracy of HTMs pretrained with 100 and 250 patterns is initially significantly higher than for HTMs pretrained with 50 patterns. However, after about 10 epochs, the gap narrows a lot, and even a simple network (pretrained on 50 patterns) performs as an HTM with more than 10 times its number of coincidences and groups (last row of Table 4) after 20 epochs of supervised refinement.

4.4.  HSR for Discriminative Fine-Tuning.

In the previous experiments, we have investigated the efficiency of HSR as an incremental learning algorithm for HTM. The experimental design was chosen to simulate a situation where the training data arrive in minibatches. However, HSR is also applicable in situations where all the training data are available from the start. It then takes the role of a class-supervised fine-tuning algorithm, which complements the initial pretraining, analogous to the gradient-based fine-tuning used to improve the performance of deep belief nets (Bengio, 2009). It is worth noting that in this scenario, HSR is not exploiting temporal structure of the data.

We designed three experiments: the first is based on the SDIGIT data set, and the second and third have been carried out with the USPS (Hastie, Tibshirani, & Friedman, 2001) and MNIST data sets (LeCun, Bottou, Bengio, & Haffner, 1998), respectively. While we do not believe that USPS and MNIST are ideal for HTM (because of the lack of temporal variations) and in general are not well suited for studying invariance (because most of the pattern variations are covered by the huge training set), providing results on these data sets is important to verify that HTM + HSR is capable of handling large real-world data sets. It also allows HTM + HSR to be compared with other techniques.

In general when HSR is used for discriminative tuning, we must avoid a pretraining training set accuracy close to 100%, since in such case, the error surface is nearly flat and there is no room for gradient based fine-tuning. This happens if the difficulty of the pretraining training set is low (data overfitting) and if the same set is used also for HSR. To circumvent this problem, three different strategies (possibly combined) can be used:

  1. Use a more complex pretraining data set without increasing the network complexity. This can be achieved by providing training data with larger intraclass variability.

  2. Reduce the model complexity. For a network with a smaller number of free parameters, the risk of pretraining data overfitting is lower. This can be achieved in practice by controlling the number of free parameters (i.e., the number of coincidences and groups that are created during pretraining).

  3. Split the training set in two portions. Use the former for pretraining and the latter for HSR. When dealing with large training sets, this is also extremely useful to limit the network complexity.

As shown in the following experiments, these three strategies are quite effective and allow HTM to efficiently be trained on large data sets (increased scalability).

4.4.1.  SDIGIT.

To evaluate how much HSR is able to increase the performance of a pretrained HTM, we used the same 6200-pattern test set as for Figure 5, but changed how we generated the training set. In fact, we noted that when the training set is generated with the zigzag exploration strategy (Maltoni, 2011), pretraining tends to achieve a training set accuracy close to 100%. To make the training set more challenging, we here employed a random walk in the space of possible translations, rotations, and scales instead of using only deterministic translations (as in the zigzag case). Even if the number of generated patterns in the training set is the same, this random walk exploration strategy tends to produce much more variability in the training set, leading to an initial training set accuracy of about 75%. It is worth noting that we are not degrading the initial accuracy on the test set to leave room for HSR learning. In fact, with the new random walk exploration, the initial accuracy on the test set is a few percentage points higher than for the zigzag case.

In Figure 10 the test set accuracy over 20 epochs is displayed for both the batch all and batch error strategies. The accuracy increase from fine-tuning in both of these cases is significant (remember that here we are not showing new patterns to the HTM across the epochs).

Figure 10:

HSR accuracy over 20 epochs when HSR is used for discriminative tuning (i.e., supervised learning on the same training set used for pretraining). For this experiment, we used HSR configuration 5 in Table 3.

Figure 10:

HSR accuracy over 20 epochs when HSR is used for discriminative tuning (i.e., supervised learning on the same training set used for pretraining). For this experiment, we used HSR configuration 5 in Table 3.

4.4.2.  USPS.

USPS is a well-known handwritten digit classification problem (Hastie et al., 2001). USPS patterns are 16 × 16 pixels, 8-bit grayscale images; the training and test sets contain 7291 and 2007 patterns, respectively. Although shape variability is relevant, most of the test set variations are covered by the large training set, and therefore even a simple approach such as the nearest-neighbor classifier achieves good classification results (94.42%). In Maltoni (2011) an HTM trained without supervision is compared with other approaches on USPS. In particular, it was shown that HTM accuracy (in the range 95% to 97% depending on the configuration; see rows 1 and 2 in Table 5) is close to other state-of-the-art techniques. However, due to the high number of training patterns, the network complexity is quite high, and consequently training and classification efficiency is slow.

Table 5:
Unsupervised HTM Configurations (#1–2) Reported in Maltoni (2011) and HTM+HSR Configuration (#3) Compared to Other Techniques (#5–9) on USPS (see Maltoni, 2011).
Time (hh:mm:ss)
ApproachPretraining/HSR PatternsMBAccuracy (%) TestTrainTest
HTM Unsup (Maltoni, 20117291 for pretraining 3.52 95.57 00:26:45 00:00:30 
HTM OV Unsup (Maltoni, 20117291 for pretraining 32.89 97.06 05:41:40 00:03:23 
HTM Unsup. + HSR 250 for pretraining, 7,291 for HSR 0.205 95.17 00:06:14 00:00:02 
NN 94.42 
MLP 94.52 
LeNet5 96.36 
SVM 96.00 
Virtual SVM 96.80 
10 Human performance 97.50 
Time (hh:mm:ss)
ApproachPretraining/HSR PatternsMBAccuracy (%) TestTrainTest
HTM Unsup (Maltoni, 20117291 for pretraining 3.52 95.57 00:26:45 00:00:30 
HTM OV Unsup (Maltoni, 20117291 for pretraining 32.89 97.06 05:41:40 00:03:23 
HTM Unsup. + HSR 250 for pretraining, 7,291 for HSR 0.205 95.17 00:06:14 00:00:02 
NN 94.42 
MLP 94.52 
LeNet5 96.36 
SVM 96.00 
Virtual SVM 96.80 
10 Human performance 97.50 

Notes: USPS training set: 7291 patterns; test set: 2,007 patterns. #10 refers to human performance.

Rows 3 in Table 5 show the performance of HSR fine-tuning. In this experiment, we employed only 250 patterns for unsupervised pretraining and the full training set for HSR. Over 15 epochs, the accuracy increased from 88.59% (initial values after pretraining) to 95.17%. The HTM+HSR final accuracy is slightly worse than (nonoverlapped) unsupervised HTM configurations reported in Maltoni (2011), but the model complexity here measured as total parameter size in MB for direct comparison with Maltoni (2011), is more than one order of magnitude lower. We are confident that analogous results could be achieved for the overlapped architecture (currently not supported by our HSR implementation), thus reaching state-of-the-art accuracy on USPS.

4.4.3.  MNIST.

MNIST is another largely used handwritten digit data set (LeCun et al., 1998). The training and test sets contain 60,000 and 10,000 patterns, respectively.

Unlike SDIGIT and USPS, MNIST patterns are 28 × 28 pixels (8-bit gray-scale images), so a different network architecture must be designed. After a few experiments, we decided to crop patterns to 24 × 24 pixels (without rescaling) and to use a 24 × 24 6 × 6 2 × 2 1 architecture, where the numbers represent the nodes in each layer, including the input layer.

Pretraining an HTM on the full MNIST data set is a difficult and time-consuming task. In our experiments, to limit the number of coincidence and groups, we trained levels 1 and 2 on a reduced training set and level 3 on the full training set. Rows 1 and 2 in Table 6 denote two architectures whose levels 1 and 2 were trained without supervision on 500 and 1000 patterns, respectively. Even if configuration 2 reaches a slightly higher accuracy (close to 98%), its complexity (here measured as number of free parameters) is much higher. In our experiments, using more than 1000 patterns for unsupervised training of levels 1 and 2 does not further increase accuracy but determines a relevant increase in complexity.

Table 6:
Unsupervised HTM Configurations (#1–2) and HTM+HSR (#3–4) Compared Against Other Techniques (#6–12) on MNIST (see http://yann.lecun.com/exdb/mnist/).
Accuracy (%)Time (hh:mm:ss)
#ApproachPretraining/HSR PatternsFree ParametersTestTrainTest
HTM unsupervised (4 saccades) 500 level 1 and 2 60,000 level 3 223K 97.93 00:12:33 00:05:09 
HTM unsupervised (4 saccades) 1000 level 1 and 2 60000 level 3 378 K 97.97 00:34:20 00:08:02 
HTM unsupervised + HSR (4 saccades) 250 pre-training 60,000 for HSR 61 K 98.60 09:47:00 00:01:10 
HTM unsupervised + HSR (4 saccades) 250 pre-training 60,000 for HSR 26 K 98.12 03:39:09 00:00:46 
Linear classifier (1-layer NN) 88.00 
KNN (L2) 95.00 
PCA + quadratic classifier 96.70 
SVM, gaussian kernel 98.60 
3-layer NN 97.05 
10 LeNet5 99.05 
11 Large Conv. Net + unsupervised pretraining. 99.47 
 (Jarrett et al., 2009     
Accuracy (%)Time (hh:mm:ss)
#ApproachPretraining/HSR PatternsFree ParametersTestTrainTest
HTM unsupervised (4 saccades) 500 level 1 and 2 60,000 level 3 223K 97.93 00:12:33 00:05:09 
HTM unsupervised (4 saccades) 1000 level 1 and 2 60000 level 3 378 K 97.97 00:34:20 00:08:02 
HTM unsupervised + HSR (4 saccades) 250 pre-training 60,000 for HSR 61 K 98.60 09:47:00 00:01:10 
HTM unsupervised + HSR (4 saccades) 250 pre-training 60,000 for HSR 26 K 98.12 03:39:09 00:00:46 
Linear classifier (1-layer NN) 88.00 
KNN (L2) 95.00 
PCA + quadratic classifier 96.70 
SVM, gaussian kernel 98.60 
3-layer NN 97.05 
10 LeNet5 99.05 
11 Large Conv. Net + unsupervised pretraining. 99.47 
 (Jarrett et al., 2009     

Notes: MNIST: training set: 60,000 patterns – test set: 10,000 patterns. HTM final inference (on the test set only) was performed with an HTM-specific “4 saccades” strategy, where inference is repeated moving the input pattern four times (a few pixels) and the posterior probabilities of the output node are iteratively updated (see HTM saccading in Maltoni, 2011). As to state-of-the-art techniques, we here list only methods operating on the original training data without preprocessing or training set extensions with artificially deformed patterns.

Rows 3 and 4 in Table 6 show the performance of HSR supervised tuning. Here, we employed 250 patterns for unsupervised pretraining for all levels, while the full training set was used for HSR. In configuration 3, 25 epochs raised the accuracy from 85.4% (initial values after pretraining) to 98.6%. Configuration 4 (whose architectural parameters were tuned to reduce coincidences and groups) reaches a slightly lower accuracy but with a much simpler architecture. In both cases, the final accuracy is better than unsupervised HTM configurations (rows 1 and 2) and network complexity is considerably lower: one order of magnitude for configuration 4.

When comparing HTM + HSR accuracy with other approaches (rows 5–11 in Table 6), we note that the proposed approach performs better than most of the baseline approaches (including KNN and 3-layer NN) and the final accuracy is not far from the state-of-the-art (row 11). It is worth noting that the architecture and parameters we used for these experiments are not optimized for the problem at hand (we simply cloned the SDIGIT architecture and extended it to work with larger images). We believe there is room for improvement (see also the discussion on overlapped architectures in the section 5).

5.  Discussion and Conclusion

In this letter, we propose a new algorithm (HSR) for incremental supervised training of hierarchical temporal memory. It can be used to optimize network behavior with sequentially arriving data or to fine-tune the network parameters on a single training set. It is computationally efficient and easy to implement due to its close connection to the native belief propagation message passing of HTM networks.

The term Q(ws), the error message sent from above to the output node (see equation 3.4) is the information that is propagated back through the network and lies at the heart of the algorithm. Its interpretation is not obvious. The first part, , the difference between the ground truth and network posterior, is easy to understand; the second part, , is more mysterious. It can be viewed as a correction term emerging because the gradient is nonzero when due to the normalization of into a probability (see appendix  A). An adjustment of PCWks proportional to simply is therefore suboptimal. A simplified example demonstrating how this correction term leads to better convergence is in appendix  E. We have also experimentally verified that the correction term is vital for HSR to produce good results.

The difference in suitable learning rate between the intermediate levels and the output level is also an important finding and can probably be explained by the fact that the PCW matrix of the output node has a more direct influence on the network posterior than the PCG matrices of the intermediate nodes. The output node memory is also trained supervised during the pretraining, while the intermediate nodes are trained unsupervised. This might suggest that there is more room for supervised fine-tuning in the intermediate nodes. The large improvements seen in the experiments of section 4 are in most part due to refinement of the intermediate node PCG matrices, which encode fuzzy clustering solutions. In particular, optimization of the PCG matrices at level 2 has proven to be the most important part of HSR. This is probably due to the fact that at higher levels, where the dimensionality is very high, the fuzzy clustering resulting from the initial unsupervised pretraining is far from optimal.

In HTM, coincidences work as pictorial words for localized feature extraction. Fine-tuning of these coincidences leads to a marginal accuracy improvement. A visual analysis of the coincidence before and after fine-tuning reveals minor changes in the corresponding pictorial words.3 This finding is in accordance with results obtained by others. In particular, Jarrett et al. (2009) have shown that multistage architectures can work well even if the overcomplete set of filters at the first level is randomly selected, while Coates, Lee, and Ng (2011) used a simple but very effective “unsupervised” k-means approach to compute centroids to be used as first-level feature extractors. In the context of HTM, we believe that the suboptimality of coincidence can in large part be compensated by proper association to temporal groups along the hierarchy.

Several methods can also be used to speed up HSR. One approach is to train the network only with patterns that in the previous iteration were misclassified. This reduces the number of updates compared to using all patterns at every iteration. Experiments suggest that the error selection strategy gives a few percent lower performance than selecting all patterns, while the boosting strategy lies in between selecting all patterns and only errors. In this letter, we experimentally found that limiting the update to the most active groups and sparsifying the PCG matrices after each update lead to a relevant improvement in accuracy and network simplicity without scarifying accuracy.

In general HSR has proven to work well for the SDIGIT, USPS, and MNIST problems, and the results give us reason to believe that this kind of supervised fine-tuning can be extended to more difficult object recognition problems such as NORB and CIFAR-10. In particular, for the discriminative tuning scenario, we here show that HTM + HSR can reach (or exceed) unsupervised HTM accuracy with a network architecture that is one order of magnitude simpler, thus enabling scalability.

Future work will focus on extending HSR to HTM with an overlapping architecture. Results in Maltoni (2011) show that consistent improvement can be achieved by enabling receptive field overlapping in HTM, even if this leads to decreased efficiency. However, this efficiency drop should be mitigated by the network simplifications made possible by HSR. Extending HSR to deal with multiple parents should not constitute a problem. Basically, instead of having a single message from a parent, the top-down messages from multiple parents can be summed to form for a given child node. We are confident that HSR can effectively deal with this generalization. We plan to provide formal demonstration and a number of supporting experiments in a future work.

Appendix A:  Derivation of the Update Rule for the Output Node

For a summary of the notation in all appendixes, see Table 7.
formula
A.1
To evaluate we need to consider two cases:
Table 7:
Summary of the Notation Used Throughout the Appendixes.
 The class posterior given by the supervisor. This can be assumed to be 1 if wc is the true class of the input pattern, 0 otherwise. 
 The posterior for class wc, estimated by the network through inference. 
L(e, e+The loss function for one training example. This is also a function of all the network parameters: C and PCG for all intermediate nodes; C, PCW, and P(wc) for the output node. 
PCWks One element of the probability matrix PCW memorized by the output node, corresponding to the probability
PCG(h)pq One element of the probability matrix PCG memorized by intermediate node h, corresponding to the probability
P(wcThe prior for class wc memorized by the output node. 
p(eThe absolute input pattern density, computed as . This corresponds to the normalizing factor in equation 2.3 where
y(h)[kActivation of coincidence k in an intermediate node h. Superscripts are used to distinguish coincidence activations in different nodes. For the output node the superscript is dropped; y[k] refers to activation of coincidence k in the output node. 
 Element s in the feedforward message vector from an intermediate node h to its parent. Note that here the “in” and “out” notation is dropped, except for feedforward messages, from the input level. For feedforward messages, the superscript denotes from which node the message is sent. 
 Element s in the feedback message sent to an intermediate node h from its parent. Note that here, the “in” and “out” notation is dropped. For feedback messages, the superscript denotes to which node the message is sent. 
c(h)j[rElement r of the jth coincidence in node h. When referring to coincidences in the output node, the superscript is dropped. 
 The class posterior given by the supervisor. This can be assumed to be 1 if wc is the true class of the input pattern, 0 otherwise. 
 The posterior for class wc, estimated by the network through inference. 
L(e, e+The loss function for one training example. This is also a function of all the network parameters: C and PCG for all intermediate nodes; C, PCW, and P(wc) for the output node. 
PCWks One element of the probability matrix PCW memorized by the output node, corresponding to the probability
PCG(h)pq One element of the probability matrix PCG memorized by intermediate node h, corresponding to the probability
P(wcThe prior for class wc memorized by the output node. 
p(eThe absolute input pattern density, computed as . This corresponds to the normalizing factor in equation 2.3 where
y(h)[kActivation of coincidence k in an intermediate node h. Superscripts are used to distinguish coincidence activations in different nodes. For the output node the superscript is dropped; y[k] refers to activation of coincidence k in the output node. 
 Element s in the feedforward message vector from an intermediate node h to its parent. Note that here the “in” and “out” notation is dropped, except for feedforward messages, from the input level. For feedforward messages, the superscript denotes from which node the message is sent. 
 Element s in the feedback message sent to an intermediate node h from its parent. Note that here, the “in” and “out” notation is dropped. For feedback messages, the superscript denotes to which node the message is sent. 
c(h)j[rElement r of the jth coincidence in node h. When referring to coincidences in the output node, the superscript is dropped. 
1. When c=s, we get
formula
A.2
2. When , we obtain
formula
A.3
We can combine both cases, equations A.2 and A.3, in equation A.1 by introducing an indicator function I(i=s), which is 1 when i=s and 0 otherwise:
formula
This leads to the definition of the error message Q(ws):
formula
A.4
Finally, we end up at equation 3.3:
formula

Appendix B:  Derivation of the Update Rule for Intermediate Nodes at the Next-to-Last Level

To find how the loss varies with respect to the probabilities stored in an intermediate node h at level (i.e., h is a child of the output node), we need to compute :
formula
B.1
where
formula
B.2
By inserting equation B.2 in equation B.1, we obtain
formula
By renaming summation indexes ( in the second row of the above equation, we then obtain
formula
Given the definition of Q(w), equation A.4, this can be rewritten as
formula
B.3
The derivative of the coincidences activation is now derived separately. Using equation 2.1 (with the superscript notation used throughout this appendix) we obtain:
formula
B.4
where m is the number of children of the output node. Since the derivative of a product is
formula
equation B.4 can be written as
formula
B.5
The derivative of the bottom-up message from child r to the output node is (refer to equation 2.2)
formula
where n(r)c is the number of coincidences in child node r of the output node. To express the condition =cj[h], we can use the indicator function (see equation 7). Then equation B.5 can be rewritten as
formula
B.6
Finally, by inserting equation B.6 in equation B.3, we get
formula
B.7
Furthermore, by noting that the expression within brackets is very similar to the top-down message from the output node to one of its children (see equation 2.8), we introduce a new quantity, the error message from the output node:
formula
Hence, we can rewrite equation B.7 as
formula
thus obtaining the update rule of equation 3.6.

Appendix C:  Generalization of the Update Rule to Intermediate Nodes at Lower Levels

Here we show that the update rule for level generalizes to lower levels. Let h be a node in level and d a node in , with d child of h We need to derive :
formula
C.1
Following the same steps leading to equation B.3 in appendix  B, we here obtain
formula
C.2
where
formula
C.3
Now we expand, first,
formula
C.4
and then , to obtain (activations are calculated in the same way (see equation B.6) in all intermediate levels):
formula
C.5
Inserting equation C.5 into equation C.4 and then into equation C.3 gives
formula
and since d is a child node of h (only),
formula
Reinserting this into equation C.2 gives
formula
C.6
Now we derive :
formula
C.7
Instead of summing over all groups in h and check if , we can set f=cj[h] (only one group in h is part of cj). This gives us
formula
C.8
By replacing with Q(wc) in equation D.8, we turn into :
formula
C.9
Finally, by combining equations C.9 and C.6, we obtain the update rule we were looking for:
formula
C.10
The derivation shown here for level can be generalized to lower levels (in a network with nlevels>4) by noting that equation C.7 can be written recursively for as
formula
C.11

Appendix D:  Derivation of the Update Rule for Coincidences at the Lowest Intermediate Level

Here we derive the coincidence update rule for level in a four-level HTM. Let h be a node in level and d a node in , with d being a child of h. We want to find where c(d)p[q] is the qth element of coincidence p in node d:
formula
D.1
Following the same steps leading to equation B.3 in appendix  B, we obtain
formula
D.2
We then expand all the derivatives:
formula
(where m is the number of children of the output node)
formula
(where m(h) is the number of children of node h)
formula
where is the input message/image to node d from its child nodes at level 0 (input level). We then reinsert everything into equation D.2:
formula
D.3
Here we are reversing this trick from appendix  C: “Instead of summing over all groups in h and check if , we can set f=cj[h] (only one group in h is part of cj)” to get a sum over the groups in h.
Given that the top-down error message to node h is
formula
and
formula
equation D.3 can be rewritten as
formula
(and by using the “sum over groups” trick again)
formula
D.4
By noting that
formula
equation D.4 can be rewritten as
formula

Appendix E:  Importance of the Error Message Correction Term

To give a better understanding of the error message, Q(ws), we here present a simplified situation where an error gradient with the same structure emerges.

Let P1(xj), the distribution that needs to be optimized, be proportional to a set of variables, aj, :
formula
Let P2(xj) be the true distribution we want P1(xj) to be as close to as possible. Given the same quadratic loss function as in equation 3.1, the loss gradient is
formula
Following the same steps as in appendix  A, we then find to be
formula
Leading to a gradient of the same form as Q(ws),
formula
E.1
We call the correction term, which interestingly is constant for all ai and emerges due to the normalization of aj into P1(xj).

To investigate the importance of C for the convergence of the gradient descent optimization, we simulated a case with two classes (N=2) where P2(x0)=1 and P2(x1)=0. The initial value of P1(xi) is P1(x0)=0 and P1(x1)=1 to give maximum initial loss. The result is plotted in Figure 11.

Figure 11:

Gradient descent in a simplified case with and without the derived correction term, C, of equation E.1. The inclusion of C clearly improves convergence as it gives a larger step size. The learning rate was here set to 0.1.

Figure 11:

Gradient descent in a simplified case with and without the derived correction term, C, of equation E.1. The inclusion of C clearly improves convergence as it gives a larger step size. The learning rate was here set to 0.1.

Clearly, a gradient proportional to P2(xi)−P1(xi) leads to slower convergence than the usage of the correct gradient as shown in equation E.1.

References

Arel
,
I.
,
Rose
,
D. C.
, &
Karnowski
,
T. P.
(
2010
).
Deep machine learning: A new frontier in artificial intelligence research
.
IEEE Computational Intelligence Magazine
,
5
(
4
),
13
18
. doi:10.1109/MCI.2010.938364
Bar
,
M.
,
Kassam
,
K. S.
,
Ghuman
,
A. S.
,
Boshyan
,
J.
,
Schmid
,
A. M.
,
Dale
,
A. M.
Halgren
,
E.
(
2006
).
Top-down facilitation of visual recognition
.
Proceedings of the National Academy of Sciences of the United States of America
,
103
(
2
),
449
54
. doi:10.1073/pnas.0507062103
Belkin
,
M.
, &
Niyogi
,
P.
(
2003
).
Laplacian eigenmaps for dimensionality reduction and data representation
.
Neural Computation
,
15
(
6
),
1373
1396
. doi:10.1162/089976603321780317
Bengio
,
Y.
(
2009
).
Learning deep architectures for AI
.
Foundations and Trends in Machine Learning
,
2
(
1
),
1
127
. doi:10.1561/2200000006
Binder
,
J.
,
Koller
,
D.
,
Russell
,
S.
, &
Kanazawa
,
K.
(
1997
).
Adaptive probabilistic networks with hidden variables
.
Machine Learning
,
29
,
213
244
. doi:10.1023/A:1007421730016
Coates
,
A.
,
Lee
,
H.
, &
Ng
,
A. Y.
(
2011
).
An analysis of single-layer networks in unsupervised feature learning
. In
Proceedings of the International Conference on Artificial Intelligance and Statistics
.
N.p.
:
AUAI Press
.
Dura-Bernal
,
S.
,
Wennekers
,
T.
, &
Denham
,
S. L.
(
2012
).
Top-down feedback in an HMAX-like cortical model of object perception based on hierarchical Bayesian networks and belief propagation
.
PloS One
,
7
(
11
),
e48216
. doi:10.1371/journal.pone.0048216
Franzius
,
M.
,
Wilbert
,
N.
, &
Wiskott
,
L.
(
2011
).
Invariant object recognition and pose estimation with slow feature analysis
.
Neural Computation
,
23
(
9
),
2289
2323
. doi:10.1162/NECO_a_00171
George
,
D.
(
2008
).
How the brain might work: A hierarchical and temporal model for learning and recognition
.
Standford, CA
:
Stanford University Press
.
George
,
D.
, &
Hawkins
,
J.
(
2009
).
Towards a mathematical theory of cortical micro-circuits
.
PLoS Computational Biology
,
5
(
10
),
e1000532
. doi:10.1371/journal.pcbi.1000532
Gilbert
,
C. D.
, &
Sigman
,
M.
(
2007
).
Brain states: Top-down influences in sensory processing
.
Neuron
,
54
(
5
),
677
696
. doi:http://dx.doi.org/10.1016/j.neuron.2007.05.019
Hastie
,
T.
,
Tibshirani
,
R.
, &
Friedman
,
J.
(
2001
).
The elements of statistical learning
.
New York
:
Springer
.
Jarrett
,
K.
,
Kavukcuoglu
,
K.
,
Ranzato
,
M. A.
, &
LeCun
,
Y.
(
2009
).
What is the best multi-stage architecture for object recognition?
In Proceedings of the
2009 IEEE 12th International Conference on Computer Vision
(pp.
2146
2153
).
Piscataway, NJ
:
IEEE
. doi:10.1109/ICCV.2009.5459469
Jensen
,
F.
(
1999
).
Gradient descent training of Bayesian networks
.
Symbolic and Quantitative Approaches to Reasoning and Uncertainty
,
1638
,
190
200
. doi:10.1007/3-540-48747-6_18
Krizhevsky
,
A.
(
2009
).
Learning multiple layers of features from tiny images
.
Master's thesis, University of Toronto
. http//www.cs.toronto.edu/∼kriz
Laskey
,
K. B.
(
1990
).
Adapting connectionist learning to Bayes networks
.
International Journal of Approximate Reasoning
,
4
(
4
),
261
282
. doi:10.1016/0888-613X(90)90002-J
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
1998
).
Gradient-based learning applied to document recognition
.
Proceedings of the IEEE
,
86
(
11
),
2278
2324
. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=726791
Lee
,
T. S.
, &
Mumford
,
D.
(
2003
).
Hierarchical Bayesian inference in the visual cortex
.
J. Opt. Soc. Am
,
20
(
7
),
1434
1448
.
Litvak
,
S.
, &
Ullman
,
S.
(
2009
).
Cortical circuitry implementing graphical models
.
Neural Computation
,
21
(
11
),
3010
3056
. doi:10.1162/neco.2009.05-08-783
Maltoni
,
D.
(
2011
).
Pattern recognition by hierarchical temporal memory
(DEIS Technical Report). Bologna University of Bologna. http://cogprints.org/9187/
Maltoni
,
D.
, &
Rehn
,
E. M.
(
2012
).
Incremental learning by message passing in hierarchical temporal memory
. In
N. Mana, F. Schwenker, & E. Trentin (Eds.)
,
Proc. 5th Workshop on Artificial Neural Networks in Pattern Recognition
(Vol.
7477
, pp.
24
35
).
New York
:
Springer
. doi:10.1007/978-3-642-33212-8_3
Neal
,
R. M.
(
1992
).
Connectionist learning of belief networks
.
Artificial Intelligence
,
56
(
1
),
71
113
. doi:http://dx.doi.org/10.1016/0004-3702(92)90065-6
Pearl
,
J.
(
1988
).
Probabilistic reasoning in intelligent systems
.
San Francisco
:
Morgan Kaufmann
.
Ranzato
,
M.
,
Huang
,
F. J.
,
Boureau
,
Y.-L.
, &
LeCun
,
Y.
(
2007
).
Unsupervised learning of invariant feature hierarchies with applications to object recognition
. In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition
(pp.
1
8
).
Piscataway, NJ
:
IEEE
. doi:10.1109/CVPR.2007.383157
Shi
,
J.
, &
Malik
,
J.
(
2000
).
Normalized cuts and image segmentation
.
IEEE Trans. Pattern Anal. Mach. Intell.
,
22
(
8
),
888
905
. doi:10.1109/34.868688
Sprekeler
,
H.
(
2011
).
On the relation of slow feature analysis and Laplacian eigenmaps
.
Neural Computation
,
23
(
12
),
3287
3302
. doi:10.1162/NECO_a_00214
Summerfield
,
C.
, &
Egner
,
T.
(
2009
).
Expectation (and attention) in visual cognition
.
Trends in Cognitive Sciences
,
13
(
9
),
403
409
. doi:10.1016/j.tics.2009.06.003
Vapnik
,
V.
(
1989
).
Statistical learning theory
.
New York
:
Wiley-Interscience
.
Wiskott
,
L.
, &
Sejnowski
,
T. J.
(
2002
).
Slow feature analysis: Unsupervised learning of invariances
.
Neural Computation
,
14
(
4
),
715
770
. doi:10.1162/089976602317318938

Notes

1

Here, we deviate from the original HTM notation of George (2008) where − and + are used to characterize messages coming/leaving from below (−) or above (+). In fact, the original notation can sometimes be ambiguous because − and + are used to denote both evidential or contextual information and node entering or leaving; for example, the in indicates that this message is leaving the node from below, but this can be misinterpreted as if the message was carrying evidential instead of contextual information.

2

We first rotate the pattern around the image center and then rescale it in the two directions.

3

In our experiments, we also observed that the network has a minor sensitivity to coincides changes with respect to intermediate-node PCG values, and larger gradient descent steps in these directions are tolerated. Thus could justify the use of a larger learning rate.