## Abstract

Hierarchical temporal memory (HTM) is a biologically inspired framework that can be used to learn invariant representations of patterns in a wide range of applications. Classical HTM learning is mainly unsupervised, and once training is completed, the network structure is frozen, thus making further training (i.e., incremental learning) quite critical. In this letter, we develop a novel technique for HTM (incremental) supervised learning based on gradient descent error minimization. We prove that error backpropagation can be naturally and elegantly implemented through native HTM message passing based on belief propagation. Our experimental results demonstrate that a two-stage training approach composed of unsupervised pretraining and supervised refinement is very effective (both accurate and efficient). This is in line with recent findings on other deep architectures.

## 1. Introduction

Hierarchical temporal memory (HTM) is a biologically inspired pattern recognition framework fairly unknown to the research community (George, 2008; George & Hawkins, 2009). It can conveniently be framed into multistage Hubel-Wiesel architectures (Ranzato, Huang, Boureau, & LeCun, 2007), a specific subfamily of deep architectures (Bengio, 2009; Jarrett, Kavukcuoglu, Ranzato, & LeCun, 2009; Arel, Rose, & Karnowski, 2010). HTM tries to mimic the feedforward and feedback projections thought to be crucial for cortical computation (Gilbert & Sigman, 2007; Summerfield & Egner, 2009). Bayesian belief propagation is used in a probabilistic hierarchical network to learn invariant spatiotemporal features of the input data, and theories exist for how such a graphical model could be mapped onto the cortical-thalamic anatomy (Lee & Mumford, 2003; George & Hawkins, 2009; Litvak & Ullman, 2009).

A comprehensive description of the HTM architecture and learning is provided in Maltoni (2011), where HTM also is shown to perform well on some pattern recognition tasks, even though further studies and validations are necessary. In Maltoni (2011), HTM is also compared to MLP (multilayer perceptron) and CN (convolutional network) on some pattern recognition problems.

The feedforward processing of HTM is closely related to hierarchical slow feature analysis (SFA) (Wiskott & Sejnowski, 2002; Franzius, Wilbert, & Wiskott, 2011). Similar to SFA, HTM employs a slowness objective during training with the goal of extracting invariant features from sequential training data. However, in contrast to SFA, where slowness is quantified by the temporal derivative of the output signals, HTM takes a Markovian graph partitioning perspective on slowness learning to learn how lower-level features are to be clustered to account for common transformations of the input. Although algorithmically quite different, the connection between the two approaches becomes evident with the insight that SFA can be viewed as a functional approximation of a Laplacian eigenmap (Belkin & Niyogi, 2003), where the underlying graph is defined by the temporal relations of the input samples (Sprekeler, 2011). In HTM the temporal graph is clustered directly using an Ncut (or similar) objective (Shi & Malik, 2000), while in SFA, the graph is used to find a “slow” subspace.

One limitation of classical HTM training and many other related algorithms, including hierarchical SFA, is that once a network is trained, it is hard to learn from new patterns without retraining it from scratch. In other words, a classical HTM is well suited for batch training based on a fixed training set and cannot effectively be trained incrementally with new patterns that were initially unavailable. In fact, every HTM level has to be trained individually and in sequence, starting from the bottom: altering the internal node structure at one network level (e.g., number of coincidences, groups) would invalidate the results of the training at higher levels. In principle, incremental learning could be carried out in a classical HTM by updating only the output level, but this is a naive strategy that in practice works only if the new incoming patterns are very similar to the existing ones in terms of building blocks.

Another limitation of classical HTM is low scalability. In fact, the number of coincidences and groups resulting from unsupervised training grows near linearly with the number of training patterns and the length of training sequences. In several practical applications, this can lead to complex (and slow) architectures.

Since incremental training and scalability are highly desirable properties of a learning system, we were motivated to investigate how HTM framework could be extended in these directions.

In this letter, we present a two-stage training approach: unsupervised temporal pretraining plus supervised refinement that can be used for incremental learning. A new HTM is initially pretrained (batch); then its internal structure is incrementally updated as new labeled samples become available. This kind of unsupervised pretraining and supervised refinement has been demonstrated to be successful for other deep architectures (Bengio, 2009).

The basic idea of our approach is to perform the batch pretraining using the algorithms described in Maltoni (2011) and fix the number of coincidences and groups throughout the whole network. During supervised refinement, we adapt the elements of the probability matrices in the output node and the intermediate nodes as if they were the weights of an MLP trained with backpropagation. With the same technique, we can also adapt the coincidences in nodes of the lowest intermediate level. Since these coincidence vectors can be conceived as pictorial words, the adaption can be viewed as a sort of dictionary refinement.

To implement the refinement, we first derived the update rules based on the descent of the error function. Since the HTM architecture is more complex than that of MLP, the resulting equations are not simple. Further complications arise from the fact that most of the manipulated scalars are probabilities and need to be normalized after each update step. Fortunately, we found a surprisingly simple (and computationally light) way to implement the whole process through native HTM belief propagation message passing.

The application of gradient descent optimization of probabilistic networks (e.g., Bayesian networks) has been investigated by others (Laskey, 1990; Neal, 1992; Binder, Koller, Russell, & Kanazawa, 1997; Jensen, 1999). Our work extends these results to an HTM pattern classifier whose structure and message passing, as pointed out in section 2.2, deviate from a classical Bayesian network.

Our experiments show promising results. Furthermore, the proposed two-stage approach not only enables incremental learning, but is also helpful for keeping network complexity under control, thus improving framework scalability.

A preliminary version of this work was presented at the ANNPR12 workshop (Maltoni & Rehn, 2012). Here we extend the incremental learning approach in several directions:

- •
Temporal groups are allowed to vary with respect to the initial temporal clustering solution obtained by pretraining.

- •
Coincidences at the lowest intermediate levels can now be included in the update process.

- •
A number of simplifications are introduced to boost efficiency without affecting accuracy. As shown in section 4, these improvements make the overall approach much more effective especially when the number of training samples is small.

- •
Discriminative fine-tuning is investigated in more depth, and the proposed approach is tested on real world data sets such as USPS and MNIST.

## 2. Background

An HTM has a hierarchical tree structure. The tree is built up by *n _{levels}* levels (or layers), each composed of one or more nodes. A node in one level is bidirectionally connected to one or more nodes in the level below and the number of nodes in each level decreases as we ascend the hierarchy. The lowest level, , is the input level, and the highest level, , with typically only one node, is the output level. Levels and nodes in between input and output are called intermediate levels and nodes. When an HTM is used for visual inference, as is the case in this study, the input level typically has a retinotopic mapping of the input. Each input node is connected to one pixel of the input image and spatially close pixels are connected to spatially close nodes. Figure 1 shows a graphical representation of a simple HTM and its levels and nodes.

It is possible, and in many cases desirable, for an HTM to have an architecture where every intermediate node has multiple parents. This creates a network where nodes have overlapping receptive fields. Throughout this letter, a nonoverlapping architecture is used instead, where nodes have only one parent, to reduce computational complexity.

### 2.1. Information Flow.

In an HTM, the flow of information is bidirectional. Belief propagation is used to pass messages both up (feedforward) and down (feedback) the hierarchy as new evidence is presented to the network. The notation used here for belief propagation (see Figure 2) follows Pearl (1988) and was adapted to HTMs by George (2008):

- •
Evidence coming from below is denoted

*e*^{−}. For visual inference, this is an image or video frame presented to level of the network. - •
Evidence from the top is denoted

*e*^{+}and can be viewed as contextual information. This can, for instance, be input from other sensory modalities, high-level attention or emotion signals (Bar et al., 2006; Gilbert & Sigman, 2007; Summerfield & Egner, 2009), or the absolute knowledge given by the supervisor training the network. - •
Feedforward messages passed up the hierarchy are denoted by and feedback messages flowing down are denoted by .

- •
Messages entering a node are denoted

^{1}by and , and messages leaving a node are denoted by and ; hence, and are feedforward messages, while and are feedback messages. For messages entering and leaving a node from below, a subscript index can be used to identify the child node from/(to) which these messages are received/(sent).

When the purpose of an HTM is that of a classifier, the feedforward message of the output node is the posterior probability that the input, *e*^{−}, belongs to one of the problem classes. We denote this posterior by , where *w _{i}* is one of

*n*classes, i.e., .

_{w}### 2.2. Internal Node Structure and Pretraining.

HTM training is performed level by level, starting from the first intermediate level. The input level does not need any training; it just forwards the input. Intermediate-level training is temporally supervised, guided by the temporal dynamics of the input (as in SFA), while the output-level training is supervised in the traditional sense using class labels. (For a detailed description, including pseudocode, refer to Maltoni, 2011.)

For every intermediate node, (see Figure 3), a set, ** C**, of so-called coincidence-patterns (or just coincidences) and a set,

**, of coincidence groups, have to be learned. A coincidence,**

*G***c**

_{i}, is a vector representing a prototypical activation pattern of the node's children. For nodes in , with input nodes as children, this corresponds to an image patch of the same size as the node's receptive field. For nodes higher up in the hierarchy, with intermediate nodes as children, each element of a coincidence,

**c**

_{i}[

*h*], is the index of a coincidence group in child

*h*. Coincidence groups, also called temporal groups, are clusters of coincidences likely to originate from simple variations of the same input pattern. Coincidences found in the same group can be spatially dissimilar but likely to be found close in time when a pattern is smoothly moved through the node's receptive field. By clustering coincidences in this way, exploiting the temporal smoothness of the input, invariant representations of the input space can be learned (George, 2008). The assignment of coincidences to groups within each node is encoded in a probability matrix

**PCG**, where each element, , represents the probability of a coincidence

**c**

_{j}, given a group

**g**

_{i}(or analogously, the likelihood of a group given a coincidence). These probabilities are the elements we will manipulate to incrementally train a network whose coincidences and groups have previously been learned and fixed.

George (2008) proposed the original algorithm used for temporal clustering in HTM, which indirectly maximizes an objective function related to temporal coincidence similarity. Maltoni (2011) instead proposed the MaxStab algorithm, which defines a within-group stability function similar to the Ncut objective (Shi & Malik, 2000) and tries to directly maximize it through a greedy heuristic. Moreover, in George (2008) the clustering is disjoint (i.e., each coincidence belongs to a single cluster), while in Maltoni (2011) clusters are fuzzy, that is, a coincidence belongs to a cluster with a given probability (encoded in **PCG)**. Experiments reported in Maltoni (2011) show that MaxStab and Fuzzy clustering lead to better performance compared to the approach George (2008) proposed.

The structure and training of the output node differ from that of the intermediate nodes. In particular the output node does not have group, only coincidences. Instead of memorizing groups and group likelihoods, it stores a probability matrix **PCW**, whose elements represent the probability of coincidence **c**_{j} given the class *w _{i}* (or, analogously, the likelihood of a class given a coincidence). This is learned in a supervised fashion by counting how many times every coincidence is the most active one (“the winner”) in the context of each class. The output node also keeps a vector of class priors,

*P*(

*w*), used to calculate the final class posterior.

_{i}HTM has a formulation that very closely resembles that of a Bayesian network. However, since the intermediate nodes are composed by both coincidences and groups, the message passing we will describe is a slightly modified version of the original belief propagation algorithm used for standard Bayesian networks (Pearl, 1988). An attempt to formulate a multistage Hubel-Wiesel architecture following a stricter definition of a Bayesian network, similar to HTM, has been made by Dura-Bernal, Wennekers, and Denham (2012).

### 2.3. Feedforward Message Passing.

*m*children, , a degree of certainty over each of the

*n*coincidences in the node is computed. This quantity is represented by a vector

_{c}**y**and can be seen as the activation of the node coincidences. The degree of certainty over coincidence

**c**

_{i}, , is The interpretation of the equation 2.1 depends on the level:

- •
If the node belongs to level 1 (hence, its child nodes are input nodes), the input message is essentially an image patch and coincidences are prototype image patches. In this case,

**y**[*i*] encodes the spatial similarity between two image patches and can be conveniently computed as a gaussian distance, where is a parameter controlling how quickly the activation level decays when deviates from**c**_{i}. - •
If the node belongs to a level >1 (its child nodes are intermediate nodes), the input message is a probability vector and coincidences can be conceived as feature selectors: each element

**c**_{i}[*j*] is the index of a single group among the groups of the*j*th child node. In this case,**y**[*i*] is proportional to the probability of the co-occurrence of subevidence (each piece of subevidence coming from a child), in the context of*c*_{i}. Assuming the subevidence to be independent, the probability is obtained by the product rule, being the subevidence coming from the*j*th child node.

*n*corresponding to where

_{g}**is the set of all coincidence groups in the node and**

*G**n*is the cardinality of

_{g}**. The component of vector is where**

*G**n*is the number of coincidences stored in the node.

_{c}### 2.4. Feedback Message Passing.

**z**, is where the division by discounts the contribution of the node feedforward information from the incoming contextual information (Pearl, 1988).

*child*node is where is the indicator function defined as The top-down message sent from the output node is computed in a similar way: Equations 2.6 and 2.8 will be important when we, in the next section, show how to incrementally update the

**PCG**and

**PCW**matrices to produce better estimates of the class posterior given some evidence from above.

## 3. HTM Supervised Refinement

Here we introduce a novel way to optimize an already trained HTM. The algorithm, HSR (HTM supervised refinement), shares many features with traditional backpropagation used to train multilayer perceptrons and is inspired by weight fine-tuning methods applied to other deep belief architectures (Bengio, 2009). It exploits the belief propagation equations we have already presented to propagate an error message from the output node back through the network. This enables each node to locally update its internal probability matrix in a way that minimizes the difference between the estimated class posterior of the network and the posterior given from above, by a supervisor.

*e*

^{−}, and the posterior given the evidence from above,

*e*

^{+}. To this purpose we employ empirical risk minimization (Vapnik, 1989) resulting in the following loss function: where

*n*is the number of classes, is the class posterior given the evidence from above, and is the posterior produced by the network using the input as evidence (i.e., inference). The loss function is also a function of all network parameters involved in the inference process. In most cases,

_{w}*e*

^{+}is a supervisor with absolute knowledge about the true class ; thus, .

To minimize the empirical risk, we first find the direction in which to alter the node probability matrices to decrease the loss and then apply gradient descent.

### 3.1. Output Node Update.

where . We call *Q*(*w _{s}*)the error message for class

*w*given some top-down and bottom-up evidence.

_{s}Finally, note that since **PCW** is a probability matrix, it has to be renormalized after each update; one has to ensure that for all .

### 3.2. Intermediate Nodes Update.

Note that **PCG** has to be renormalized after each update such that for all .

### 3.3. Coincidence Update in the Lowest Intermediate Level.

*n*is the coincidence dimensionality (image patch size). The loss gradient with respect to element

_{d}*q*of coincidence

*c*_{p}can be shown to take the following form (see appendix D), where

**Bel**

_{Q}[

*p*] is the “error belief” in coincidence

*c*_{p}defined as

### 3.4. HSR Pseudocode.

By updating the probability matrices for every training example instead of at the end of the presentation of a group of patterns, an online version of the algorithm is obtained. Both batch and online versions of HSR are investigated in section 4.

In many cases, it is preferable for the nodes in lower intermediate levels to share memory—so-called node sharing (Maltoni, 2011). This speeds up training and forces all the nodes of the level to respond in the same way when the same stimulus is presented at different places in the receptive field. In a level operating in node -sharing mode, the **PCG** update, equation 3.5, must be performed only for the master node, that is, is averaged over all node positions.

## 4. Experiments

To verify the efficacy of the HSR algorithm, we designed a number of experiments. These are performed using the SDIGIT data set, a machine-generated digit recognition problem (Maltoni, 2011). SDIGIT patterns (16 × 16 pixels, gray-scale images) are generated by geometric transformations of class prototypes called primary patterns. The possibility of randomly generating new patterns makes this data set suitable for evaluating incremental learning algorithms. By varying the amount of scaling and rotation applied to the primary patterns, we can also control the problem difficulty. Such a flexibility cannot be obtained by using other well-known data sets such as NORB (Jarrett et al., 2009) or CIFAR-10 (Krizhevsky, 2009).

With we denote a set of *n* patterns, including, for each of the 10 digits, the primary pattern and, further, (*n*/10)−1 patterns generated by scaling and rotation of the primary pattern according to random triplets (*s _{x}*,

*s*,

_{y}*r*) where , and .

^{2}Note that during HTM, pretraining, each of the

*n*patterns is translated into all possible locations according to a zigzag exploration strategy, thus providing the necessary temporal continuity (see Maltoni, 2011 for details).

The creation of a test set starts by translating each of the 10 primary patterns at all positions that allow it to be fully contained (with a 2 pixel background offset) in the 16 × 16 window, thus obtaining *m* patterns; then, for each of the *m* patterns, (*n*/10)−1 further patterns are generated by transforming the pattern according to random triplets (*s _{x}*,

*s*,

_{y}*r*). The total number of patterns in the test set is then

*m*× n/10. Examples of generated test set SDIGIT patterns are shown in Figure 4.

Table 1 (reprinted from Maltoni, 2011) summarizes HTM performance on the SDIGIT problem and compares it against other well-known classification approaches. HTM accuracy is 71.37%, 87.56%, and 94.61% with 50, 100, and 250 training patterns, respectively. Our goal is to understand if and how accuracy can be improved by incrementally training the network through the HSR approach.

. | . | Accuracy (%) . | Time (hh:mm:ss) . | . | ||
---|---|---|---|---|---|---|

Training set . | Approach . | Training . | Test . | Training . | Test . | Size (MB) . |

NN | 100 | 57.92 | <1 sec | 00:00:04 | 3.50 | |

(50,0.70,1.0,0.7,1.0,40) | MLP | 100 | 61.15 | 00:12:42 | 00:00:03 | 1.90 |

1788 translated patterns | LeNet5 | 100 | 67.28 | 00:07:13 | 00:00:11 | 0.39 |

HTM | 100 | 71.37 | 00:00:08 | 00:00:13 | 0.58 | |

NN | 100 | 73.63 | <1 sec | 00:00:07 | 6.84 | |

(100,0.70,1.0,0.7,1.0,40) | MLP | 100 | 75.37 | 00:34:22 | 00:00:03 | 1.90 |

3423 translated patterns | LeNet5 | 100 | 79.31 | 00:10:05 | 00:00:11 | 0.39 |

HTM | 100 | 87.56 | 00:00:25 | 00:00:23 | 1.00 | |

NN | 100 | 86.50 | <1 sec | 00:00:20 | 17.0 | |

(250,0.70,1.0,0.7,1.0,40) | MLP | 99.93 | 86.08 | 00:37:32 | 00:00:03 | 1.90 |

8705 translated patterns | LeNet5 | 100 | 89.17 | 00:14:37 | 00:00:11 | 0.39 |

HTM | 100 | 94.61 | 00:02:04 | 00:00:55 | 2.06 |

. | . | Accuracy (%) . | Time (hh:mm:ss) . | . | ||
---|---|---|---|---|---|---|

Training set . | Approach . | Training . | Test . | Training . | Test . | Size (MB) . |

NN | 100 | 57.92 | <1 sec | 00:00:04 | 3.50 | |

(50,0.70,1.0,0.7,1.0,40) | MLP | 100 | 61.15 | 00:12:42 | 00:00:03 | 1.90 |

1788 translated patterns | LeNet5 | 100 | 67.28 | 00:07:13 | 00:00:11 | 0.39 |

HTM | 100 | 71.37 | 00:00:08 | 00:00:13 | 0.58 | |

NN | 100 | 73.63 | <1 sec | 00:00:07 | 6.84 | |

(100,0.70,1.0,0.7,1.0,40) | MLP | 100 | 75.37 | 00:34:22 | 00:00:03 | 1.90 |

3423 translated patterns | LeNet5 | 100 | 79.31 | 00:10:05 | 00:00:11 | 0.39 |

HTM | 100 | 87.56 | 00:00:25 | 00:00:23 | 1.00 | |

NN | 100 | 86.50 | <1 sec | 00:00:20 | 17.0 | |

(250,0.70,1.0,0.7,1.0,40) | MLP | 99.93 | 86.08 | 00:37:32 | 00:00:03 | 1.90 |

8705 translated patterns | LeNet5 | 100 | 89.17 | 00:14:37 | 00:00:11 | 0.39 |

HTM | 100 | 94.61 | 00:02:04 | 00:00:55 | 2.06 |

Notes: SDIGIT: test set: , (6200 patterns, 10 classes). The table is reprinted from Maltoni (2011). Three experiments are performed with an increasing number of training patterns: 50, 100, and 250. The test set is common across the experiments and includes 6200 patterns. NN, MLP, and LeNet5 refer to nearest neighbor, multilayer perceptron, and convolutional network, respectively. HTM refers to a four-level hierarchical temporal memory (whose architecture is shown in Figure 1) trained in MaxStab configuration with Fuzzy grouping enabled (see, Maltoni, 2011, for details). The best accuracy for each training set is in bold.

To this purpose, we employ an experimental procedure where we first pretrain a new network using a data set (with *n* patterns) and then, for a number of epochs,*n _{e}*, generate new data sets and apply HSR. At each epoch, one can apply HSR for several iterations, to favor convergence. However, we experimentally found that a good trade-off between convergence time and overfitting can be achieved by performing just two HSR iterations for each epoch (

*n*=2). The classification accuracy is calculated using the patterns generated for every epoch, but before the network is updated using those patterns. In this way, we emulate a situation where the network is trained on sequentially arriving patterns.

_{iter}### 4.1. HSR Configurations and Training Strategies.

The HSR algorithm described in section 3.4 can be (computationally) simplified by excluding some of the variables from the update procedure or by annihilating negligible values after each update. The following configurations will be considered throughout the experiments:

- •
*Partial update*. The full HSR approach, as described in section 3.4, updates the**PCW**matrix at the output level, the**PCG**matrices at intermediate levels, and coincidences,, at . In the following, we will also test configurations where some of the above data are excluded from the HSR update.*C* - •
*Fixed*versus*variable groups*. Equation 3.5 requires that each*PCG*element ( is updated, potentially leading to a large number of updates. Since_{pq}**PCG**encodes the clustering solution (i.e., coincidence group memberships), changing a zero element to a nonzero value will alter the initial (pretraining) cluster assignment. One possible simplification consists of keeping the fuzzy coincidence group clustering fixed. In the following, we refer to the case where only the*PCG*elements that were nonzero after pretraining are updated by HSR as fixed groups and with variable groups the case where all_{pq}*PCG*are updated._{pq} - •
*Top percent groups*. When this option is enabled, the update is limited to*PCG*elements for which_{pq}**y**[*p*] is large. In fact, when equation 3.7 is considered, the amount of update is proportional to**y**[*p*], and we can assume that the gradient is negligible when**y**[*p*] is small. To implement this strategy, we compute and sort**y**[*p*] at each node update. Starting from the top value, we then keep only**y**[*p*] elements until their sum exceeds a certain percentage,*tpg*, of the whole**y**[*p*] sum (*tpg*being a parameter). - •
*Sparse PCG*. When this option is enabled, the**PCG**matrices are sparsified after each update. This is performed by retaining, for each coincidence*p*, only the groups to which*p*has the strongest association. To this purpose, we sort*PCG*for each_{pq}*p*, and then, starting from the top (i.e., the most probable), we keep on adding groups until the cumulative probability is above*spcg*(*spcg*being a parameter). Note that the sparsification must be performed before the renormalization of the**PCG**matrices (see the pseudocode in section 3.4).

Independent of the HSR configuration, different strategies and parameters can be adopted for the HSR training:

- •
Batch versus online updating. See section 3.4.

- •
Error versus all selection strategy. For the

*error*selection strategy, supervised refinement is performed only for patterns that were misclassified by the current HTM (i.e., in the previous iteration), while for the*all*selection, strategy updating is performed for all patterns. - •
*Learning rate*. In general the effect on performance of the learning rate is similar to other neural network architectures (e.g., MLP). A nonmonotonic convergence (suffering from oscillations) is a symptom of too high a learning rate; in this case, the network convergence is not guaranteed. But a very slow convergence is the result of a learning rate that is too small. One striking finding of our experiments is that the learning rate (see equations 3.2, 3.5, and 3.9) for the output node should be kept much lower than for the intermediate nodes and the learning rate for the intermediate nodes much lower than learning rate for coincidences. In the following, we refer to the learning rate at the output node as , to the learning rate for intermediate nodes as , and to the learning rate at coincidences as . We experimentally found that the optimal learning rates (for the SDIGIT data set) are , , and . In any case, selecting proper values for the learning rates is not critical, and comparable accuracy levels can be achieved by using different learning rates in a certain range.

### 4.2. HSR Accuracy.

#### 4.2.1. Basic Configuration.

**PCW**at the output level and

**PCG**at intermediate levels, while coincidences at are excluded. Groups are kept fixed, and the options

*top percent groups*and

*sparse PCG*are not enabled. This is the same configuration as tested in Maltoni and Rehn (2012). There, as here, the training and update sets are:

In particular, each update set is composed of 6200 patterns; examples of the resulting pattern variations are shown in Figure 4. This can be considered a medium-difficult problem because of the pattern variation and interclass similarity (rotated 1 and 7; small 6, 8, and 9), but the high number of training patterns makes the iterative learning quite effective. The accuracy gain due to HSR is studied for different strategies: *batch* versus *online* and *error* versus *all*. Figure 5 shows the accuracy gain achieved by HSR over 20 epochs of incremental learning, starting from an HTM pretrained with *n*=50 patterns. Accuracy at epoch 1 corresponds to the accuracy after pretraining, which is about 72%. A few epochs of HSR training are then sufficient to raise accuracy to 93% to 95%. The growth then slows and, after 20 epochs, the network accuracy is in the range 97% to 98.5% for the different configurations (see Table 2). It is worth remembering that the accuracy reported for each epoch is always measured using previously unseen data.

Training over all the patterns (with respect to training over misclassified patterns only) provides a small advantage (1% to 2%). *Online* updating seems to yield slightly better performance during the first few epochs, but then the accuracies of *online* and *batch* updating are almost equivalent.

Table 2 compares the computation time of different strategies. Applying supervised refinement only to misclassified patterns significantly reduces computation time, while switching between *batch* and *online* strategies is not relevant for efficiency. Therefore, considering that accuracy of the *errors* strategy is not far from the *all* strategy, the *errors* strategy is a good option when training on large data sets.

. | . | HSR Time: . | . |
---|---|---|---|

. | Accuracy . | Entire data set (6200 patterns), . | HSR Time: . |

Strategy . | (epoch 20) . | 1 iteration . | 1 pattern, 1 iteration . |

Batch, all | 98.49 % | 19.27 sec | 3.11 ms |

Batch, error | 97.05 % | 8.37 sec | 1.35 ms |

Online, all | 98.11 % | 22.75 sec | 3.66 ms |

Online, error | 96.97 % | 8.27 sec | 1.34 ms |

. | . | HSR Time: . | . |
---|---|---|---|

. | Accuracy . | Entire data set (6200 patterns), . | HSR Time: . |

Strategy . | (epoch 20) . | 1 iteration . | 1 pattern, 1 iteration . |

Batch, all | 98.49 % | 19.27 sec | 3.11 ms |

Batch, error | 97.05 % | 8.37 sec | 1.35 ms |

Online, all | 98.11 % | 22.75 sec | 3.66 ms |

Online, error | 96.97 % | 8.27 sec | 1.34 ms |

Note: Time values refer to our C# (.Net) implementation under Windows 7 on a Xeon CPU W3550 at 3.07 GHz.

#### 4.2.2. From Fixed to Variable Groups.

In this round of experiments, we start from the previous configuration with the goal of studying the impact of using *variable groups*. To make the incremental learning more difficult, we reduce the size of each set with 1/10, that is, ; hence, each update set is now composed of 620 patterns (instead of 6,200). Figure 6 compares the accuracy of *fixed groups* and *variable groups* for both the *batch all* and *batch errors* training strategies. It is worth noting that:

- •
The accuracy of

*fixed groups*is 7 to 10 percentage points worse with respect to the previous experiment due to the reduction in the update set size. - •
The accuracy of

*variable groups*is significantly better than*fixed groups*and reaches an accuracy comparable to the previous experiment 97% to 99% (even though the update sets are 10 times larger). - •
The gap between

*batch all*and*batch errors*remains small for*variable groups*(1% to 2% after 10 epochs) but increases for*fixed groups*(about 5%) as a consequence of the reduction in the number of patterns.

These experiments show that the iterative adaption of the fuzzy clustering found during pretraining, which is implicit in the variable groups configuration, is an effective approach to improve the accuracy. This configuration also allows us to limit the update to the error patterns without degrading accuracy much. However, a closer look at the experiment statistics reveals that the extra accuracy comes at the expense of training efficiency and network complexity (compare the first two rows of Table 3). In particular:

- •
HSR computing time (average time for one iteration over 620 patterns) takes 0.95 and 19.03 seconds for

*fixed*and*variable groups*, respectively. This is mainly due to the much higher number of*PCG*elements that must be updated. Note that for_{pq}*fixed groups*, the iteration time is much lower than that reported in Table 2 because of the smaller size of the update sets. - •
For

*fixed groups*, the average**PCG**density (percentage of nonzero elements) across intermediate nodes is 0.98% and 2.12% at levels 1 and 2, respectively. Since only nonzero*PCG*elements are updated, the average sparsity remains the same across epochs independent of the learning strategy. For the_{pq}*variable groups*configuration, the average**PCG**density grows to 90.14% and 71.66% at levels 1 and 2, respectively. The data above refer to the*batch error*strategy and are measured at epoch 20. In fact, the experiments show that the average density remains quite stable during the different epochs. In conclusion,*variable groups*lead to a dramatic increase in**PCG**density, thus increasing network complexity. Unfortunately, a complex network (large and dense**PCG**) is memory demanding and makes message passing (both feedforward and feedback) less efficient.

An effective way to achieve good accuracy without sacrificing efficiency and complexity is to enable the *top percent groups* and *sparse PCG* options. Figure 7 shows the accuracy of this configuration for different parameters: *tpg* and *spcg* (see section 4.1). Rows 3 to 5 in Table 3 denote the corresponding efficiency and complexity. In particular the parameterization *tpg* = 50% and *spcg* = 97% preserves the accuracy quite well of *variable groups* and at the same time exhibits an efficiency and complexity not far from that of *fixed groups*.

. | . | . | Top Percent . | . | . | HSR Time . | PCG density
. |
---|---|---|---|---|---|---|---|

. | . | Fixed versus . | Groups . | Sparse PCG . | Accuracy . | (620 patterns), . | . |

. | Update . | Variable Groups . | (tpg)
. | (spcg)
. | (epoch 20) . | 1 Iteration . | (epoch 20) . |

1 | PCW+PCG | Fixed groups | Not enabled | Not enabled | 88.86% | 0.95 sec | 0.98%, 2.12% |

2 | PCW+PCG | Variable groups | Not enabled | Not enabled | 97.92% | 19.03 sec | 90.14%, 71.66% |

3 | PCW+PCG | Variable groups | 50% | Not enabled | 98.00% | 5.04 sec | 22.71%, 27.34% |

4 | PCW+PCG | Variable groups | 25% | Not enabled | 97,28% | 3.37 sec | 12.65%, 17.99% |

5 | PCW+PCG | Variable groups | 50% | 97% | 97.56% | 1.83 sec | 1.89%, 4.04% |

6 | PCW+PCG+C | Variable groups | 50% | 97% | 97.70% | 2.06 sec | 2.05%, 4.35% |

. | . | . | Top Percent . | . | . | HSR Time . | PCG density
. |
---|---|---|---|---|---|---|---|

. | . | Fixed versus . | Groups . | Sparse PCG . | Accuracy . | (620 patterns), . | . |

. | Update . | Variable Groups . | (tpg)
. | (spcg)
. | (epoch 20) . | 1 Iteration . | (epoch 20) . |

1 | PCW+PCG | Fixed groups | Not enabled | Not enabled | 88.86% | 0.95 sec | 0.98%, 2.12% |

2 | PCW+PCG | Variable groups | Not enabled | Not enabled | 97.92% | 19.03 sec | 90.14%, 71.66% |

3 | PCW+PCG | Variable groups | 50% | Not enabled | 98.00% | 5.04 sec | 22.71%, 27.34% |

4 | PCW+PCG | Variable groups | 25% | Not enabled | 97,28% | 3.37 sec | 12.65%, 17.99% |

5 | PCW+PCG | Variable groups | 50% | 97% | 97.56% | 1.83 sec | 1.89%, 4.04% |

6 | PCW+PCG+C | Variable groups | 50% | 97% | 97.70% | 2.06 sec | 2.05%, 4.35% |

Notes: Time values refer to our C# (.Net) implementation under Windows 7 on a Xeon CPU W3550 at 3.07 GHz. The strategy *Batch*, *Errors* was used for all experiments.

#### 4.2.3. Enabling Coincidence Update.

In this round of experiments, we enable coincide update at level 1, and we observe that:

In order to better understand which specific updates are most relevant for the accuracy improvement seen in Figure 8, we also report results achieved with configurations characterized by partial update. We note that coincidence update is relevant when groups are fixed but become negligible otherwise. Variable groups are crucial for HSR, and this underlines the importance of temporal groups in HTM. In particular, group update at seems to be the most relevant one; this is probably due to the difficulty of finding (during unsupervised pretraining) an optimal temporal clustering solution at high levels due to the increasing dimensionality of the pattern space higher up in the hierarchy.

### 4.3. HTM Scalability.

One drawback of the current HTM framework is scalability. In fact, the network complexity considerably increases with the number and dimensionality of the training data. All the experiments reported in Maltoni (2011) clearly show that the number of coincidences and groups rapidly increases with the number of patterns in the training sequences. Table 4 shows the accuracy and the total number of coincidences and groups in an HTM pretrained with an increasing number of patterns. As expected, the accuracy increases with the training set size, but after 250 patterns, the accuracy improvement slows down while the network memory (coincidences and groups) continues to grow markedly, leading to bulky networks.

Number of Pretraining . | Accuracy after . | Coincidence and . |
---|---|---|

patterns . | Pretraining . | Groups . |

50 | 71.37% | 7193; 675 |

100 | 87.56% | 13,175; 1185 |

250 | 94.61% | 29,179; 2460 |

500 | 93.55% | 53,127; 4215 |

750 | 96.97% | 73,277; 5569 |

1000 | 97.44% | 92,366; 6864 |

Number of Pretraining . | Accuracy after . | Coincidence and . |
---|---|---|

patterns . | Pretraining . | Groups . |

50 | 71.37% | 7193; 675 |

100 | 87.56% | 13,175; 1185 |

250 | 94.61% | 29,179; 2460 |

500 | 93.55% | 53,127; 4215 |

750 | 96.97% | 73,277; 5569 |

1000 | 97.44% | 92,366; 6864 |

Figure 9 shows the accuracy improvement by HSR for HTMs pretrained over 50, 100 and 250 patterns. For this experiment we use configuration 5 in Table 3, which is not the most accurate but represents a good trade-off between accuracy and efficiency. It is worth remembering that HSR does not alter the number of coincidences and groups in the pretrained network. Therefore, the complexity in terms of number of coincidence and groups after any number of epoch is the same of all the pretrained HTMs (refer to Table 4). Furthermore, since configurations with *top percent groups* and *sparse PCG* enabled allows us to keep the **PCG** density low, we can conclude that HSR does not significantly increase the overall HTM model complexity.

In Figure 9 it is interesting to note that the accuracy of HTMs pretrained with 100 and 250 patterns is initially significantly higher than for HTMs pretrained with 50 patterns. However, after about 10 epochs, the gap narrows a lot, and even a simple network (pretrained on 50 patterns) performs as an HTM with more than 10 times its number of coincidences and groups (last row of Table 4) after 20 epochs of supervised refinement.

### 4.4. HSR for Discriminative Fine-Tuning.

In the previous experiments, we have investigated the efficiency of HSR as an incremental learning algorithm for HTM. The experimental design was chosen to simulate a situation where the training data arrive in minibatches. However, HSR is also applicable in situations where all the training data are available from the start. It then takes the role of a class-supervised fine-tuning algorithm, which complements the initial pretraining, analogous to the gradient-based fine-tuning used to improve the performance of deep belief nets (Bengio, 2009). It is worth noting that in this scenario, HSR is not exploiting temporal structure of the data.

We designed three experiments: the first is based on the SDIGIT data set, and the second and third have been carried out with the USPS (Hastie, Tibshirani, & Friedman, 2001) and MNIST data sets (LeCun, Bottou, Bengio, & Haffner, 1998), respectively. While we do not believe that USPS and MNIST are ideal for HTM (because of the lack of temporal variations) and in general are not well suited for studying invariance (because most of the pattern variations are covered by the huge training set), providing results on these data sets is important to verify that HTM + HSR is capable of handling large real-world data sets. It also allows HTM + HSR to be compared with other techniques.

In general when HSR is used for discriminative tuning, we must avoid a pretraining training set accuracy close to 100%, since in such case, the error surface is nearly flat and there is no room for gradient based fine-tuning. This happens if the difficulty of the pretraining training set is low (data overfitting) and if the same set is used also for HSR. To circumvent this problem, three different strategies (possibly combined) can be used:

Use a more complex pretraining data set without increasing the network complexity. This can be achieved by providing training data with larger intraclass variability.

Reduce the model complexity. For a network with a smaller number of free parameters, the risk of pretraining data overfitting is lower. This can be achieved in practice by controlling the number of free parameters (i.e., the number of coincidences and groups that are created during pretraining).

Split the training set in two portions. Use the former for pretraining and the latter for HSR. When dealing with large training sets, this is also extremely useful to limit the network complexity.

As shown in the following experiments, these three strategies are quite effective and allow HTM to efficiently be trained on large data sets (increased scalability).

#### 4.4.1. SDIGIT.

To evaluate how much HSR is able to increase the performance of a pretrained HTM, we used the same 6200-pattern test set as for Figure 5, but changed how we generated the training set. In fact, we noted that when the training set is generated with the zigzag exploration strategy (Maltoni, 2011), pretraining tends to achieve a training set accuracy close to 100%. To make the training set more challenging, we here employed a random walk in the space of possible translations, rotations, and scales instead of using only deterministic translations (as in the zigzag case). Even if the number of generated patterns in the training set is the same, this random walk exploration strategy tends to produce much more variability in the training set, leading to an initial training set accuracy of about 75%. It is worth noting that we are not degrading the initial accuracy on the test set to leave room for HSR learning. In fact, with the new random walk exploration, the initial accuracy on the test set is a few percentage points higher than for the zigzag case.

In Figure 10 the test set accuracy over 20 epochs is displayed for both the *batch all* and *batch error* strategies. The accuracy increase from fine-tuning in both of these cases is significant (remember that here we are not showing new patterns to the HTM across the epochs).

#### 4.4.2. USPS.

USPS is a well-known handwritten digit classification problem (Hastie et al., 2001). USPS patterns are 16 × 16 pixels, 8-bit grayscale images; the training and test sets contain 7291 and 2007 patterns, respectively. Although shape variability is relevant, most of the test set variations are covered by the large training set, and therefore even a simple approach such as the nearest-neighbor classifier achieves good classification results (94.42%). In Maltoni (2011) an HTM trained without supervision is compared with other approaches on USPS. In particular, it was shown that HTM accuracy (in the range 95% to 97% depending on the configuration; see rows 1 and 2 in Table 5) is close to other state-of-the-art techniques. However, due to the high number of training patterns, the network complexity is quite high, and consequently training and classification efficiency is slow.

. | . | . | . | . | Time (hh:mm:ss) . | |
---|---|---|---|---|---|---|

. | Approach . | Pretraining/HSR Patterns . | MB . | Accuracy (%) Test . | Train . | Test . |

1 | HTM Unsup (Maltoni, 2011) | 7291 for pretraining | 3.52 | 95.57 | 00:26:45 | 00:00:30 |

2 | HTM OV Unsup (Maltoni, 2011) | 7291 for pretraining | 32.89 | 97.06 | 05:41:40 | 00:03:23 |

3 | HTM Unsup. + HSR | 250 for pretraining, 7,291 for HSR | 0.205 | 95.17 | 00:06:14 | 00:00:02 |

5 | NN | - | - | 94.42 | - | - |

6 | MLP | - | - | 94.52 | - | - |

7 | LeNet5 | - | - | 96.36 | - | - |

8 | SVM | - | - | 96.00 | - | - |

9 | Virtual SVM | - | - | 96.80 | - | - |

10 | Human performance | - | - | 97.50 | - | - |

. | . | . | . | . | Time (hh:mm:ss) . | |
---|---|---|---|---|---|---|

. | Approach . | Pretraining/HSR Patterns . | MB . | Accuracy (%) Test . | Train . | Test . |

1 | HTM Unsup (Maltoni, 2011) | 7291 for pretraining | 3.52 | 95.57 | 00:26:45 | 00:00:30 |

2 | HTM OV Unsup (Maltoni, 2011) | 7291 for pretraining | 32.89 | 97.06 | 05:41:40 | 00:03:23 |

3 | HTM Unsup. + HSR | 250 for pretraining, 7,291 for HSR | 0.205 | 95.17 | 00:06:14 | 00:00:02 |

5 | NN | - | - | 94.42 | - | - |

6 | MLP | - | - | 94.52 | - | - |

7 | LeNet5 | - | - | 96.36 | - | - |

8 | SVM | - | - | 96.00 | - | - |

9 | Virtual SVM | - | - | 96.80 | - | - |

10 | Human performance | - | - | 97.50 | - | - |

Notes: USPS training set: 7291 patterns; test set: 2,007 patterns. #10 refers to human performance.

Rows 3 in Table 5 show the performance of HSR fine-tuning. In this experiment, we employed only 250 patterns for unsupervised pretraining and the full training set for HSR. Over 15 epochs, the accuracy increased from 88.59% (initial values after pretraining) to 95.17%. The HTM+HSR final accuracy is slightly worse than (nonoverlapped) unsupervised HTM configurations reported in Maltoni (2011), but the model complexity here measured as total parameter size in MB for direct comparison with Maltoni (2011), is more than one order of magnitude lower. We are confident that analogous results could be achieved for the overlapped architecture (currently not supported by our HSR implementation), thus reaching state-of-the-art accuracy on USPS.

#### 4.4.3. MNIST.

MNIST is another largely used handwritten digit data set (LeCun et al., 1998). The training and test sets contain 60,000 and 10,000 patterns, respectively.

Unlike SDIGIT and USPS, MNIST patterns are 28 × 28 pixels (8-bit gray-scale images), so a different network architecture must be designed. After a few experiments, we decided to crop patterns to 24 × 24 pixels (without rescaling) and to use a 24 × 24 6 × 6 2 × 2 1 architecture, where the numbers represent the nodes in each layer, including the input layer.

Pretraining an HTM on the full MNIST data set is a difficult and time-consuming task. In our experiments, to limit the number of coincidence and groups, we trained levels 1 and 2 on a reduced training set and level 3 on the full training set. Rows 1 and 2 in Table 6 denote two architectures whose levels 1 and 2 were trained without supervision on 500 and 1000 patterns, respectively. Even if configuration 2 reaches a slightly higher accuracy (close to 98%), its complexity (here measured as number of free parameters) is much higher. In our experiments, using more than 1000 patterns for unsupervised training of levels 1 and 2 does not further increase accuracy but determines a relevant increase in complexity.

. | . | . | . | Accuracy (%) . | Time (hh:mm:ss) . | |
---|---|---|---|---|---|---|

# . | Approach . | Pretraining/HSR Patterns . | Free Parameters . | Test . | Train . | Test . |

1 | HTM unsupervised (4 saccades) | 500 level 1 and 2 60,000 level 3 | 223K | 97.93 | 00:12:33 | 00:05:09 |

2 | HTM unsupervised (4 saccades) | 1000 level 1 and 2 60000 level 3 | 378 K | 97.97 | 00:34:20 | 00:08:02 |

3 | HTM unsupervised + HSR (4 saccades) | 250 pre-training 60,000 for HSR | 61 K | 98.60 | 09:47:00 | 00:01:10 |

4 | HTM unsupervised + HSR (4 saccades) | 250 pre-training 60,000 for HSR | 26 K | 98.12 | 03:39:09 | 00:00:46 |

5 | Linear classifier (1-layer NN) | - | - | 88.00 | - | - |

6 | KNN (L2) | - | - | 95.00 | - | - |

7 | PCA + quadratic classifier | - | - | 96.70 | - | - |

8 | SVM, gaussian kernel | - | - | 98.60 | - | - |

9 | 3-layer NN | - | - | 97.05 | - | - |

10 | LeNet5 | - | - | 99.05 | - | - |

11 | Large Conv. Net + unsupervised pretraining. | - | - | 99.47 | - | - |

(Jarrett et al., 2009) |

. | . | . | . | Accuracy (%) . | Time (hh:mm:ss) . | |
---|---|---|---|---|---|---|

# . | Approach . | Pretraining/HSR Patterns . | Free Parameters . | Test . | Train . | Test . |

1 | HTM unsupervised (4 saccades) | 500 level 1 and 2 60,000 level 3 | 223K | 97.93 | 00:12:33 | 00:05:09 |

2 | HTM unsupervised (4 saccades) | 1000 level 1 and 2 60000 level 3 | 378 K | 97.97 | 00:34:20 | 00:08:02 |

3 | HTM unsupervised + HSR (4 saccades) | 250 pre-training 60,000 for HSR | 61 K | 98.60 | 09:47:00 | 00:01:10 |

4 | HTM unsupervised + HSR (4 saccades) | 250 pre-training 60,000 for HSR | 26 K | 98.12 | 03:39:09 | 00:00:46 |

5 | Linear classifier (1-layer NN) | - | - | 88.00 | - | - |

6 | KNN (L2) | - | - | 95.00 | - | - |

7 | PCA + quadratic classifier | - | - | 96.70 | - | - |

8 | SVM, gaussian kernel | - | - | 98.60 | - | - |

9 | 3-layer NN | - | - | 97.05 | - | - |

10 | LeNet5 | - | - | 99.05 | - | - |

11 | Large Conv. Net + unsupervised pretraining. | - | - | 99.47 | - | - |

(Jarrett et al., 2009) |

Notes: MNIST: training set: 60,000 patterns – test set: 10,000 patterns. HTM final inference (on the test set only) was performed with an HTM-specific “4 saccades” strategy, where inference is repeated moving the input pattern four times (a few pixels) and the posterior probabilities of the output node are iteratively updated (see HTM saccading in Maltoni, 2011). As to state-of-the-art techniques, we here list only methods operating on the original training data without preprocessing or training set extensions with artificially deformed patterns.

Rows 3 and 4 in Table 6 show the performance of HSR supervised tuning. Here, we employed 250 patterns for unsupervised pretraining for all levels, while the full training set was used for HSR. In configuration 3, 25 epochs raised the accuracy from 85.4% (initial values after pretraining) to 98.6%. Configuration 4 (whose architectural parameters were tuned to reduce coincidences and groups) reaches a slightly lower accuracy but with a much simpler architecture. In both cases, the final accuracy is better than unsupervised HTM configurations (rows 1 and 2) and network complexity is considerably lower: one order of magnitude for configuration 4.

When comparing HTM + HSR accuracy with other approaches (rows 5–11 in Table 6), we note that the proposed approach performs better than most of the baseline approaches (including KNN and 3-layer NN) and the final accuracy is not far from the state-of-the-art (row 11). It is worth noting that the architecture and parameters we used for these experiments are not optimized for the problem at hand (we simply cloned the SDIGIT architecture and extended it to work with larger images). We believe there is room for improvement (see also the discussion on overlapped architectures in the section 5).

## 5. Discussion and Conclusion

In this letter, we propose a new algorithm (HSR) for incremental supervised training of hierarchical temporal memory. It can be used to optimize network behavior with sequentially arriving data or to fine-tune the network parameters on a single training set. It is computationally efficient and easy to implement due to its close connection to the native belief propagation message passing of HTM networks.

The term *Q*(*w _{s}*), the error message sent from above to the output node (see equation 3.4) is the information that is propagated back through the network and lies at the heart of the algorithm. Its interpretation is not obvious. The first part, , the difference between the ground truth and network posterior, is easy to understand; the second part, , is more mysterious. It can be viewed as a correction term emerging because the gradient is nonzero when due to the normalization of into a probability (see appendix A). An adjustment of

*PCW*proportional to simply is therefore suboptimal. A simplified example demonstrating how this correction term leads to better convergence is in appendix E. We have also experimentally verified that the correction term is vital for HSR to produce good results.

_{ks}The difference in suitable learning rate between the intermediate levels and the output level is also an important finding and can probably be explained by the fact that the **PCW** matrix of the output node has a more direct influence on the network posterior than the **PCG** matrices of the intermediate nodes. The output node memory is also trained supervised during the pretraining, while the intermediate nodes are trained unsupervised. This might suggest that there is more room for supervised fine-tuning in the intermediate nodes. The large improvements seen in the experiments of section 4 are in most part due to refinement of the intermediate node **PCG** matrices, which encode fuzzy clustering solutions. In particular, optimization of the **PCG** matrices at level 2 has proven to be the most important part of HSR. This is probably due to the fact that at higher levels, where the dimensionality is very high, the fuzzy clustering resulting from the initial unsupervised pretraining is far from optimal.

In HTM, coincidences work as pictorial words for localized feature extraction. Fine-tuning of these coincidences leads to a marginal accuracy improvement. A visual analysis of the coincidence before and after fine-tuning reveals minor changes in the corresponding pictorial words.^{3} This finding is in accordance with results obtained by others. In particular, Jarrett et al. (2009) have shown that multistage architectures can work well even if the overcomplete set of filters at the first level is randomly selected, while Coates, Lee, and Ng (2011) used a simple but very effective “unsupervised” k-means approach to compute centroids to be used as first-level feature extractors. In the context of HTM, we believe that the suboptimality of coincidence can in large part be compensated by proper association to temporal groups along the hierarchy.

Several methods can also be used to speed up HSR. One approach is to train the network only with patterns that in the previous iteration were misclassified. This reduces the number of updates compared to using all patterns at every iteration. Experiments suggest that the *error* selection strategy gives a few percent lower performance than selecting all patterns, while the *boosting* strategy lies in between selecting all patterns and only errors. In this letter, we experimentally found that limiting the update to the most active groups and sparsifying the **PCG** matrices after each update lead to a relevant improvement in accuracy and network simplicity without scarifying accuracy.

In general HSR has proven to work well for the SDIGIT, USPS, and MNIST problems, and the results give us reason to believe that this kind of supervised fine-tuning can be extended to more difficult object recognition problems such as NORB and CIFAR-10. In particular, for the discriminative tuning scenario, we here show that HTM + HSR can reach (or exceed) unsupervised HTM accuracy with a network architecture that is one order of magnitude simpler, thus enabling scalability.

Future work will focus on extending HSR to HTM with an overlapping architecture. Results in Maltoni (2011) show that consistent improvement can be achieved by enabling receptive field overlapping in HTM, even if this leads to decreased efficiency. However, this efficiency drop should be mitigated by the network simplifications made possible by HSR. Extending HSR to deal with multiple parents should not constitute a problem. Basically, instead of having a single message from a parent, the top-down messages from multiple parents can be summed to form for a given child node. We are confident that HSR can effectively deal with this generalization. We plan to provide formal demonstration and a number of supporting experiments in a future work.

## Appendix A: Derivation of the Update Rule for the Output Node

The class posterior given by the supervisor. This can be assumed to be 1 if w is the true class of the input pattern, 0 otherwise. _{c} | |

The posterior for class w, estimated by the network through inference. _{c} | |

L(e^{−}, e^{+}) | The loss function for one training example. This is also a function of all the network parameters: and CPCG for all intermediate nodes; , CPCW, and P(w) for the output node. _{c} |

PCW _{ks} | One element of the probability matrix PCW memorized by the output node, corresponding to the probability . |

PCG^{(h)}_{pq} | One element of the probability matrix PCG memorized by intermediate node h, corresponding to the probability . |

P(w) _{c} | The prior for class w memorized by the output node. _{c} |

p(e^{−}) | The absolute input pattern density, computed as . This corresponds to the normalizing factor in equation 2.3 where . |

y^{(h)}[k] | Activation of coincidence k in an intermediate node h. Superscripts are used to distinguish coincidence activations in different nodes. For the output node the superscript is dropped; y[k] refers to activation of coincidence k in the output node. |

Element s in the feedforward message vector from an intermediate node h to its parent. Note that here the “in” and “out” notation is dropped, except for feedforward messages, from the input level. For feedforward messages, the superscript denotes from which node the message is sent. | |

Element s in the feedback message sent to an intermediate node h from its parent. Note that here, the “in” and “out” notation is dropped. For feedback messages, the superscript denotes to which node the message is sent. | |

c^{(h)}_{j}[r] | Element r of the jth coincidence in node h. When referring to coincidences in the output node, the superscript is dropped. |

The class posterior given by the supervisor. This can be assumed to be 1 if w is the true class of the input pattern, 0 otherwise. _{c} | |

The posterior for class w, estimated by the network through inference. _{c} | |

L(e^{−}, e^{+}) | The loss function for one training example. This is also a function of all the network parameters: and CPCG for all intermediate nodes; , CPCW, and P(w) for the output node. _{c} |

PCW _{ks} | One element of the probability matrix PCW memorized by the output node, corresponding to the probability . |

PCG^{(h)}_{pq} | One element of the probability matrix PCG memorized by intermediate node h, corresponding to the probability . |

P(w) _{c} | The prior for class w memorized by the output node. _{c} |

p(e^{−}) | The absolute input pattern density, computed as . This corresponds to the normalizing factor in equation 2.3 where . |

y^{(h)}[k] | Activation of coincidence k in an intermediate node h. Superscripts are used to distinguish coincidence activations in different nodes. For the output node the superscript is dropped; y[k] refers to activation of coincidence k in the output node. |

Element s in the feedforward message vector from an intermediate node h to its parent. Note that here the “in” and “out” notation is dropped, except for feedforward messages, from the input level. For feedforward messages, the superscript denotes from which node the message is sent. | |

Element s in the feedback message sent to an intermediate node h from its parent. Note that here, the “in” and “out” notation is dropped. For feedback messages, the superscript denotes to which node the message is sent. | |

c^{(h)}_{j}[r] | Element r of the jth coincidence in node h. When referring to coincidences in the output node, the superscript is dropped. |

*I*(

*i*=

*s*), which is 1 when

*i*=

*s*and 0 otherwise: This leads to the definition of the error message

*Q*(

*w*): Finally, we end up at equation 3.3:

_{s}## Appendix B: Derivation of the Update Rule for Intermediate Nodes at the Next-to-Last Level

*Q*(

*w*), equation A.4, this can be rewritten as The derivative of the coincidences activation is now derived separately. Using equation 2.1 (with the superscript notation used throughout this appendix) we obtain: where

*m*is the number of children of the output node. Since the derivative of a product is equation B.4 can be written as The derivative of the bottom-up message from child

*r*to the output node is (refer to equation 2.2) where

*n*

^{(r)}

_{c}is the number of coincidences in child node

*r*of the output node. To express the condition =

**c**

_{j}[

*h*], we can use the indicator function (see equation 7). Then equation B.5 can be rewritten as Finally, by inserting equation B.6 in equation B.3, we get Furthermore, by noting that the expression within brackets is very similar to the top-down message from the output node to one of its children (see equation 2.8), we introduce a new quantity, the error message from the output node: Hence, we can rewrite equation B.7 as thus obtaining the update rule of equation 3.6.

## Appendix C: Generalization of the Update Rule to Intermediate Nodes at Lower Levels

*h*be a node in level and

*d*a node in , with

*d*child of

*h*We need to derive : Following the same steps leading to equation B.3 in appendix B, we here obtain where Now we expand, first, and then , to obtain (activations are calculated in the same way (see equation B.6) in all intermediate levels): Inserting equation C.5 into equation C.4 and then into equation C.3 gives and since

*d*is a child node of

*h*(only), Reinserting this into equation C.2 gives Now we derive : Instead of summing over all groups in

*h*and check if , we can set

*f*=

**c**

_{j}[

*h*] (only one group in

*h*is part of

**c**

_{j}). This gives us By replacing with

*Q*(

*w*) in equation D.8, we turn into : Finally, by combining equations C.9 and C.6, we obtain the update rule we were looking for: The derivation shown here for level can be generalized to lower levels (in a network with

_{c}*n*>4) by noting that equation C.7 can be written recursively for as

^{levels}## Appendix D: Derivation of the Update Rule for Coincidences at the Lowest Intermediate Level

*h*be a node in level and

*d*a node in , with

*d*being a child of

*h*. We want to find where

*c*^{(d)}

_{p}[

*q*] is the

*q*th element of coincidence

*p*in node

*d*: Following the same steps leading to equation B.3 in appendix B, we obtain We then expand all the derivatives: (where

*m*is the number of children of the output node) (where

*m*

^{(h)}is the number of children of node

*h*) where is the input message/image to node

*d*from its child nodes at level 0 (input level). We then reinsert everything into equation D.2: Here we are reversing this trick from appendix C: “Instead of summing over all groups in

*h*and check if , we can set

*f*=

**c**

_{j}[

*h*] (only one group in

*h*is part of

**c**

_{j})” to get a sum over the groups in

*h*.

*h*is and equation D.3 can be rewritten as (and by using the “sum over groups” trick again) By noting that equation D.4 can be rewritten as

## Appendix E: Importance of the Error Message Correction Term

To give a better understanding of the error message, *Q*(*w _{s}*), we here present a simplified situation where an error gradient with the same structure emerges.

*P*

_{1}(

*x*), the distribution that needs to be optimized, be proportional to a set of variables,

_{j}*a*, : Let

_{j}*P*

_{2}(

*x*) be the true distribution we want

_{j}*P*

_{1}(

*x*) to be as close to as possible. Given the same quadratic loss function as in equation 3.1, the loss gradient is Following the same steps as in appendix A, we then find to be Leading to a gradient of the same form as

_{j}*Q*(

*w*), We call the correction term, which interestingly is constant for all

_{s}*a*and emerges due to the normalization of

_{i}*a*into

_{j}*P*

_{1}(

*x*).

_{j}To investigate the importance of *C* for the convergence of the gradient descent optimization, we simulated a case with two classes (*N*=2) where *P*_{2}(*x*_{0})=1 and *P*_{2}(*x*_{1})=0. The initial value of *P*_{1}(*x _{i}*) is

*P*

_{1}(

*x*

_{0})=0 and

*P*

_{1}(

*x*

_{1})=1 to give maximum initial loss. The result is plotted in Figure 11.

Clearly, a gradient proportional to *P*_{2}(*x _{i}*)−

*P*

_{1}(

*x*) leads to slower convergence than the usage of the correct gradient as shown in equation E.1.

_{i}## References

## Notes

^{1}

Here, we deviate from the original HTM notation of George (2008) where − and + are used to characterize messages coming/leaving from below (−) or above (+). In fact, the original notation can sometimes be ambiguous because − and + are used to denote both evidential or contextual information and node entering or leaving; for example, the in indicates that this message is leaving the node from below, but this can be misinterpreted as if the message was carrying evidential instead of contextual information.

^{2}

We first rotate the pattern around the image center and then rescale it in the two directions.

^{3}

In our experiments, we also observed that the network has a minor sensitivity to coincides changes with respect to intermediate-node **PCG** values, and larger gradient descent steps in these directions are tolerated. Thus could justify the use of a larger learning rate.

Proceedings of the