## Abstract

The Wilkie, Stonham, and Aleksander recognition device (WiSARD) $n$-tuple classifier is a multiclass weightless neural network capable of learning a given pattern in a single step. Its architecture is determined by the number of classes it should discriminate. A target class is represented by a structure called a discriminator, which is composed of $N$ RAM nodes, each of them addressed by an $n$-tuple. Previous studies were carried out in order to mitigate an important problem of the WiSARD $n$-tuple classifier: having its RAM nodes saturated when trained by a large data set. Finding the VC dimension of the WiSARD $n$-tuple classifier was one of those studies. Although no exact value was found, tight bounds were discovered. Later, the bleaching technique was proposed as a means to avoid saturation. Recent empirical results with the bleaching extension showed that the WiSARD $n$-tuple classifier can achieve high accuracies with low variance in a great range of tasks. Theoretical studies had not been conducted with that extension previously. This work presents the exact VC dimension of the basic two-class WiSARD $n$-tuple classifier, which is linearly proportional to the number of RAM nodes belonging to a discriminator, and exponentially to their addressing tuple length, precisely $N(2n-1)+1$. The exact VC dimension of the bleaching extension to the WiSARD $n$-tuple classifier, whose value is the same as that of the basic model, is also produced. Such a result confirms that the bleaching technique is indeed an enhancement to the basic WiSARD $n$-tuple classifier as it does no harm to the generalization capability of the original paradigm.

## 1  Introduction

The Wilkie, Stonham, and Aleksander recognition device (WiSARD; Aleksander, Thomas, & Bowden, 1984) is a versatile weightless artificial neural network (WANN). Its origins date back to the $n$-tuple classifier, which was initially proposed by Bledsoe and Browning (1959) and then formally defined in Steck (1962). WiSARD has a modular architecture and trains unseen patterns in a single pass, making it an efficient learning machine. However, little attention was given to this model because it could not handle large loads of data. The bleaching technique (Grieco, Lima, Gregorio, & França, 2010), a recent enhancement developed for WiSARD, solved this issue. Applications built employing this technique indicated their ability to handle large loads of data (Carneiro, França, & Lima, 2015; Carneiro, Pedreira, França, & Lima, 2017; Cardoso et al., 2016). When compared with state-of-the-art systems, these applications showed competitive results, sometimes outperforming them.

Theoretical research and architecture improvements were performed on the basic WiSARD $n$-tuple classifier (Bledsoe & Bisson, 1962; Roy & Sherman, 1967; Ullmann, 1969, 1971; Stonham, 1977; Tarling & Rohwer, 1993; Bradshaw & Aleksander, 1996; Bradshaw, 1996, 1997; Mitchell, Bishop, & Minchinton, 1996; Rohwer & Morciniec, 1996, 1997, 1998; Gregorio, 1997; Jørgensen & Linneberg, 1999; Linneberg & Jørgensen, 1999; Wickert & França, 2001; Azhar & Dimond, 2004; Grieco et al., 2010; Carvalho, Carneiro, França, & Lima, 2013). One of the main issues raised by those studies was how to find a means to mitigate memory saturation. As WiSARD was fed with (especially if noisy) data, it started filling up every position of its memory nodes, with its capability of discriminating patterns deteriorating.

Among the studies intended to lessen the effects of saturation, it is worth mentioning the work of Bradshaw (1996), where lower and upper bounds for the VC dimension of WiSARD $n$-tuple classifier were calculated. No exact value for this measure was found. Bradshaw's (1996) research provided a fertile ground for further analyses on this field. Unfortunately, the solution to the saturation problem proposed there was not that successful, as it relied on convergence and had no guarantee whatsoever that it would actually happen.

Grieco et al. (2010) devised a way of mitigating saturation, called the bleaching technique. It allowed the network to be exposed to loads of data (noisy and free of noise) and yet keep the reliability of its pattern discrimination capability. This improvement considerably enhanced both the accuracy and precision of WiSARD $n$-tuple classifier applications with no performance harm on their training procedures and just a small bit on the classification step (de Souza, França, & Lima, 2014, 2015; Carneiro et al., 2015; Nascimento et al., 2015; Cardoso et al., 2016).

That technique granted WiSARD competitiveness with trending learning systems by achieving high accuracy with low variance in experimental work. However, there was no theoretical background for bleaching up to this point. This letter aims to provide a mathematical foundation for that extension of the weightless classifier by analyzing the generalization capacity of both basic and bleaching recognition schemes.

The letter is structured as follows. Section 2 introduces some basic definitions of the VC theory and presents WiSARD and bleaching $n$-tuple classifiers. A formal mathematical definition of both models is provided in section 3. Their VC dimensions are calculated in sections 4 and 5. The results are then discussed in section 6, where some conclusions are drawn and comparisons with weighted learning schemes are made. Section 7 summarizes the work presented and proposes future work.

## 2  Conceptual Background

This section presents some VC theory basic concepts and a brief introduction to both architectures of the WiSARD model. They provide a cornerstone for the calculations presented in sections 4 and 5.

### 2.1  VC Dimension Definitions

To measure the capacity of a learning system like the WiSARD network, definitions are required for some basic VC theory concepts. For instance, the VC dimension is intrinsically related to the notion of set shattering, which in turn depends on simpler ideas, like dichotomies.

Let $X$ be a set of data points to be classified and $Θ$ a set of parameterizations of a learning machine $L$. $L$ classifies data points $x∈X$ with labels $y∈{-1,1}$ according to a parameter vector $θ∈Θ$. In other words, a learning system $L$ can be seen as a function $g:X×Θ→-1,1$.

Definition 1.
Let $x1,x2,…,xℓ∈X$. The set of dichotomies $L$ can realize on $X$ are defined as
$ΔΘX=gx1;θ,gx2;θ,…,gxℓ;θ:θ∈Θ.$
2.1

That is, the dichotomies are the distinct forms a learning machine can split a set of points into two, each assigned to a different class. The number of dichotomies a learning system $L$ realizes on a set $X$ plays an important role in VC dimension study. The growth function and the concept of shattering derive from that quantity.

Definition 2.

A set $X$ is said to be shattered by a learning system $L$ if $card(ΔΘ(X))=2card(X)$.

In other words, if a learning machine $L$ shatters $X$, then for every $X'⊆X$, there is at least one parameterization $θ∈Θ$ of $L$, such that every $x∈X'$ is classified as 1 and every other element of $X$ is classified as $-1$. $X$ is shattered by $L$ if $L$ can realize all possible dichotomies of $X$.

Definition 3.
The growth function $ΠΘ:N→N$ is defined as the maximum number of dichotomies $L$ can implement on samples of size $ℓ$; that is, the growth function is defined as
$ΠΘ(ℓ)=maxX:card(X)=ℓcard(ΔΘ(X)).$
2.2

Definitions 2 and 3 imply that the growth function $ΠΘℓ$ is $2ℓ$ if there is at least one set $X,cardX=ℓ,$ that is shattered by $L$. Note that if $L$ shatters a set $X$, it also does every subset $X'⊂X$. So if $ΠΘℓ=2ℓ$, then $ΠΘk=2k$ for every $k≤ℓ$. However, the growth function may have a nonexponential nature if $ℓ$ is large enough, so that there is no set $L$ can shatter. The largest value of $ℓ$ for which the growth function $ΠΘℓ$ is $2ℓ$ is called the VC dimension of learning machine $L$ (Vapnik & Chervonenkis, 1971; Vapnik, 1998; Abu-Mostafa, Magdon-Ismail, & Lin, 2012).

Definition 4.

The VC dimension $dVC$ of learning machine $L$ is defined as the cardinality of the largest set $X$ that is shattered by $L$. If for every $ℓ$, $L$ shatters a set of such cardinality, then $dVCL$ is infinite.

### 2.2  The WiSARD Model

WiSARD is an $n$-tuple classifier with two key features; the splitting of its input into various $n$-tuples and the storage of its learned knowledge in memory nodes, which are addressed by those $n$-tuples. This section contemplates the basic WiSARD model and its enhanced version produced by the bleaching technique.

The learning of new patterns is made through simple changes in the content of the system's memory nodes. There is no need for convergence-dependent processes or complex calculations. In other words, $n$-tuple classifiers have efficient training procedures and so are adequate for empirical studies where there is a large number of hyperparameters to be tuned and massive loads of data to be learned.

#### 2.2.1  The Basic WiSARD $n$-Tuple Classifier

The $n$-tuple method was initially proposed by Bledsoe and Browning (1959) and formally defined in Steck (1962). Its most widely known implementation, WiSARD, was made in 1982 (Aleksander et al., 1984). It showed that it was actually possible to assemble the $n$-tuple classifier.

The WiSARD $n$-tuple classifier is considered a weightless neural network for the resemblance of its memory nodes to actual neurons. The weightless paradigm is characterized by the storage of the network learned knowledge inside its neuronal nodes, whereas weighted models store it in synaptic connections. WiSARD is also known as a RAM-based neural network because the implementation of its memory nodes was effected by actual random access memories (RAMs). Due to this, the network memory nodes are also known as RAM nodes or simply RAMs.

System architecture. The basic WiSARD $n$-tuple classifier is a Boolean neural network that receives bit strings as inputs and produces similarity scores associated with target classes. This is made through the use of structures called discriminators. Each discriminator is associated with a single class.

The discriminators have a simple structure: a pseudo-random mapping that shuffles the bits of the network input and splits them into tuples, which address memory nodes. Figure 1a portrays a picture mapped into the series of tuples $10:01:10:10:01:10$. An $n$-tuple classifier is defined according to its tuple length $n$ and the number of memory nodes in each discriminator $N$.

Figure 1:

Basic WiSARD classifier.

Figure 1:

Basic WiSARD classifier.

In a canonical $n$-tuple classifier, a same pseudo-random mapping can be used for every discriminator. This way, $N$$n$-tuples address all memory nodes of each discriminator. Figure 1b depicts a canonical $n$-tuple classifier, where the same series of tuples produced in Figure 1a addresses the memory nodes of both discriminators, $D-1$ and $D1$.

The VC dimension is a measure used for systems capable of disambiguating between two classes—functions that partition a set of points into two subsets, each one assigned to a particular class. This letter explores solely $n$-tuple classifiers whose set of classes $C$ has only two of them, denoted $-1$ and 1.

Training procedure. Training starts by the network setup, where all memory positions are initialized with 0. Every time a training observation and its corresponding class are sent to the network, the model selects the discriminator of that class and marks every memory position addressed by the observation. Positions marked this way have their stored values set to 1. At the end of the training procedure, every memory position should store a 1 if it was addressed at least once, and 0 otherwise.

Recognition procedure. If a new input pattern is presented to the network, every discriminator responds with a similarity score, representing how close this new pattern is to the learned knowledge stored in the discriminator. The similarity score is defined as the number of memory nodes whose addressed positions store 1, that is, those positions that were accessed at least once during the training step. This score is represented by summation devices $Σ-1$ and $Σ1$ in Figure 1b.

The network opts for the discriminator whose score is the highest and assigns its class to the input pattern. If there are at least two discriminators that share the highest response, then the network outputs that a tie happened. In the latter case, the classification system may apply some policy to choose one class or use a draw procedure.

The discriminator similarity score is its number of addressed memory positions. It implies that the number of nodes $N$ (or the size of the addressing tuples $n$) plays an important role in defining the generalization capability of the WiSARD classifier. High values of $N$ (or small ones of $n$) reduce the chance that the similarity score will get too low if a newly presented pattern is slightly different by only a few bits from what the network learned. For example, given a network that is presented to a 30-bit input pattern and only one of its RAMs does not recognize it, then (1) if there are 30 RAMs addressed by 1-bit tuples, the similarity score of the classifier will be $29/30$; (2) if there are 15 RAMs addressed by 2-bit tuples, the similarity lowers to $14/15$; and (3) if there is only one RAM addressed by the whole network input, the similarity will be 0. So the greater is $N$, the more generalizing is the WiSARD system.

#### 2.2.2  The WiSARD Extended with Bleaching

Basic $n$-tuple classifiers tend to experience misclassification problems when trained with a large number of data. This occurs because many memory positions should be set to 1 even if they were accessed only once due to a slightly noisy observation. This problem is known as saturation and has been considered one of the major disadvantages of the basic $n$-tuple classifier since its conception (Bledsoe & Browning, 1959).

Some attempts were made to overcome this drawback (Bledsoe & Bisson, 1962; Ullmann, 1971; Tarling & Rohwer, 1993; Bradshaw, 1996; Azhar & Dimond, 2004). A solution, the bleaching technique, was devised in 2010 (Grieco et al., 2010). Its contribution as a major upgrade to the basic architecture allowed the production of accurate and precise applications (de Souza et al., 2014; Carneiro et al., 2015, 2017; Nascimento et al., 2015; Cardoso et al., 2016).

Training procedure. The bleaching $n$-tuple classifier training procedure differs from the basic one by the storage of any integer value in the memory positions instead of only 0 and 1. Every element of the memory nodes initializes as 0 during the network setup, as in the binary process. When a position is addressed at training, its stored value is incremented by 1, whereas it would only be set to 1 in the standard procedure, independent of its original value. Summarizing, at the end of the training step, each memory position stores how many times it was accessed.

Recognition procedure. The integer values stored in the memory nodes give the network a better representation of the trained data. However, another structure must be added to the basic architecture for proper classification. The bleaching technique introduces a threshold $β$ (known as the bleaching threshold). It is responsible for defining which memory positions should contribute to the similarity score of the class.

Threshold $β$ is a nonnegative integer number. The canonical bleaching $n$-tuple classifier employs a single threshold for the entire network. If a memory position is addressed during the classification step, it should be further subjected to $β$. A node fires 1 if its addressed position stores a value greater than $β$; otherwise, it fires 0. Therefore, the similarity score of a discriminator of a fixed-threshold bleaching $n$-tuple classifier is the number of its accessed positions whose stored value is greater than $β$. Figure 2 portrays the classification of an input pattern by a bleaching $n$-tuple classifier with $β=2$.

Figure 2:

Bleaching classifier.

Figure 2:

Bleaching classifier.

In practical applications, however, a dynamic threshold is employed instead. The threshold is initialized as $β=0$ at the network setup. Thus, at first glance, the dynamic-threshold bleaching $n$-tuple classifier works like the basic classifier, where the score of a discriminator is defined as the number of accessed positions whose stored value is greater than 0. If a selection criterion for a given class is not satisfied at the classification of a pattern, $β$ is incremented by 1 and a new score is calculated. The iterative procedure continues until that criterion is satisfied. The selection criterion of a two-class bleaching $n$-tuple classifier consists of having a single discriminator whose score is greater than that of its opposing class. The iterative procedure of incrementing $β$ continues until there is no longer a tie between the discriminator scores.

## 3  Mathematical Formulation

The basic and bleaching $n$-tuple classifiers are defined according to $N$, their number of RAM-nodes, and $n$, the address length of these nodes. Every notation involving those models is derived from these two parameters. This section is split in three parts; one concerns the procedure of encoding the input into $n$-tuples, and the other two are related to the models themselves. In each part, a mathematical notation is provided and an example is offered for clarity.

Both models treat a given input in an identical way, shuffling and splitting it into $N$ addressing $n$-tuples. When this procedure is put in mathematical form, it is seen as a two-part process, in which an input vector $xk∈0,1Nn$ is first transformed into a tuple matrix $Tk∈0,1N×n$ and then into a feature matrix $Zk∈0,1N×2n$.

#### 3.1.1  Notation

Let $X$ be an ordered set of input data points, $X=x1,x2,…$. An input $xk∈X$ must be such that $xk∈0,1Nn$ for it to be read by the $n$-tuple classifier. Also, let $y$ be a generic vector of classes $yk∈{-1,1}$ with as many elements as $X$, so that there is a specific correspondence between every $k$th input data point $xk$ and the $k$th element of $y$.

Given a permutation $π$ of $Nn$ elements, $π:1,…,Nn→1,…,Nn$, let $Tk∈0,1N×n$ be an $N×n$ tuple matrix obtained from input data point $xk$ through a bijective map, such that every element $tk,i,j$ of $Tk$ is equal to the $πi-1n+j$th element of $xk$. This procedure characterizes the pseudo-random mapping of the WiSARD model, in which an input is shuffled and then reshaped as an ordered set of $N$$n$-digit binary addresses, here represented by an $N×n$ binary matrix.

Let $Z$ be an ordered set of feature matrices, $Z=Z1,Z2,…$. A feature matrix $Zk∈0,1N×2n$ is obtained from tuple matrix $Tk∈0,1N×n$ via an addressing function $A:0,1N×n→0,1N×2n$. $A$ is a row-wise function; it performs the same operation $a·$ for every row $tk,i∈Tk$, resulting in a corresponding row $zk,i∈Zk$. $a·$ produces an indicator vector, such that every element $zk,i,j$ of $zk,i$ is defined according to
$zk,i,j=1,ifj=1+∑m=1ntk,i,m·2n-m0,otherwise,$
3.1
where $tk,i,m$ is the $m$th element of $n$-tuple $tk,i$. That is, each column $j$ of $Zk$ is assigned to a particular $n$-tuple, and its cells $zk,i,j$ are set to 1 if $tk,i$ is the $n$-tuple associated with column $j$ and set to 0 otherwise.

It is worth noting that the resulting matrix $Zk$ has a single element per row equal to 1 and the remaining ones to 0. This is a direct consequence of the fact that each $n$-tuple of $Tk$ addresses one and only one memory position.

For readability, throughout this letter, data input point $xk$ and any of its derived matrices may appear as $tk,1:tk,2:⋯:tk,N$, where $tk,i$ is the $n$-tuple represented by the $i$th row of tuple matrix $Tk$, which addresses memory nodes $d-1,i$ and $d1,i$. These $n$-tuples can be $0n$, an all-zero $n$-tuple, $1n$, an all-one one, or $0,1n$, a generic binary one. The notation introduced here is provided in Table 1.

Table 1:
 $X$ Ordered set of input data points $xk$ $k$th element of $X$ $y$ Vector of classes–one for each element of $X$ $yk$ $k$th element of $y$⁠, expected class of $xk$ $π$ Permutation that characterizes the input mapping $Tk$ $k$th element of $T$⁠; tuple matrix associated with input $xk$ $tk,1:…:tk,N$ $n$-tuple representation of $xk$ $tk,i$ $i$th $n$-tuple of $xk$⁠, which addresses $dc,i$⁠; $i$th row of $Tk$ $Z$ Ordered set of feature matrices $Zk$ $k$th element of $Z$⁠; feature matrix associated with input $xk$ $zk,i$ $i$th row of $Zk$ $zk,i,j$ $j$th element of $zk,i$ $A$ Addressing function $0,1n$ Generic binary $n$-tuple $0n$ All-zero $n$-tuple $1n$ All-one $n$-tuple
 $X$ Ordered set of input data points $xk$ $k$th element of $X$ $y$ Vector of classes–one for each element of $X$ $yk$ $k$th element of $y$⁠, expected class of $xk$ $π$ Permutation that characterizes the input mapping $Tk$ $k$th element of $T$⁠; tuple matrix associated with input $xk$ $tk,1:…:tk,N$ $n$-tuple representation of $xk$ $tk,i$ $i$th $n$-tuple of $xk$⁠, which addresses $dc,i$⁠; $i$th row of $Tk$ $Z$ Ordered set of feature matrices $Zk$ $k$th element of $Z$⁠; feature matrix associated with input $xk$ $zk,i$ $i$th row of $Zk$ $zk,i,j$ $j$th element of $zk,i$ $A$ Addressing function $0,1n$ Generic binary $n$-tuple $0n$ All-zero $n$-tuple $1n$ All-one $n$-tuple

#### 3.1.2  Example

Figure 1a depicts the mapping of a 12-bit input (represented in a $4×3$ grid) to six $n$-tuples of length 2. This mapping is done so that the given input can be introduced to an $n$-tuple classifier with $N=6$ memory nodes in each discriminator, which are addressed by $n$-tuples of length $n=2$, such as those in Figures 1b and 2.

The input in the retina of Figure 1a can be written in an $n$-tuple representation as $xk=10:01:10:10:01:10$, which can be also be displayed as a tuple matrix:
$Tk=100110100110.$
3.2
$Tk$, in turn, is transformed into a feature matrix,
$Zk=001001000010001001000010,$
3.3
through the use of addressing function $A$. By assigning the first column of $Zk$ to address 00, the second one to 01, the third to 10, and the last to 11, one can verify that every $n$-tuple 10 (the first, third, fourth, and last rows of $Tk$) is represented by matrix row $0010$. The remaining rows of $Zk$ represent $n$-tuples 01.

### 3.2  The Basic WiSARD $n$-Tuple Classifier

The mathematical formulation of the basic WiSARD $n$-tuple classifier focuses on the representation of its discriminators and how to characterize the WiSARD model as a function that receives a feature matrix $Zk$ and outputs a class $c∈-1,1$.

#### 3.2.1  Notation

Let $WN,n$ be the family of functions that denotes every two-class $n$-tuple classifier defined by parameters $N$ and $n$ ($W$ stands for WiSARD). Also let $WN,n·;M-1,M1∈WN,n$ be a function representing an $n$-tuple classifier instance defined as above, whose learned knowledge on any class $c$ is characterized by memory matrix $Mc∈0,1N×2n$. Every network instance $WN,n·;M-1,M1$ has two discriminators, denoted $D-1$ and $D1$. They are, respectively, represented by $M-1$ and $M1$, the memory matrices that store their learned knowledge. $Mc$ is defined as $Mc=mc,i,jN×2n$, where $mc,i,j∈0,1$.

For any feature matrix $Zk$ mapped from an input $xk∈X$, $WN,n(Zk;M-1,M1)$ produces a score vector $sZk=s-1Zk,s1ZkT$, where $scZk∈0,1,…,N$ is the score of $Dc$, defined as
$scZk=∑i=1N∑j=12nmc,i,jzk,i,j,$
3.4
where $zk,i,j$ is the element at the $i$th row and $j$th column of $Zk$. Given the score vector $s$, function $WN,nZk;M-1,M1$ returns the corresponding class as
$WN,nZk;M-1,M1=argmaxc∈-1,1scZk.$
3.5

The notation introduced here is condensed in Table 2.

Table 2:
Notation for the Basic $n$-Tuple Classifier.
 $N$ Number of nodes $n$ Addressing tuple length $WN,n$ $N$-node $n$-tuple classifying model $c$ Generic class (⁠$-1$ or 1) $M-1$ Memory matrix of class $-1$ $M1$ Memory matrix of class 1 $Mc$ Memory matrix of class $c$ $mc,i,j$ Element at $i$th row and $j$th column of $Mc$ $WN,n·;M-1,M1$ $N$-node $n$-tuple classifier with memory matrices $M-1$ and $M1$ $D-1$ Discriminator for class $-1$ $D1$ Discriminator for class 1 $Dc$ Discriminator for class $c$ $dc,i$ $i$th memory node of $Dc$ $s-1Zk$ Score of $D-1$ for $Zk$ $s1Zk$ Score of $D1$ for $Zk$ $sZk$ Score vector $sZk=s-1Zk,s1ZkT$
 $N$ Number of nodes $n$ Addressing tuple length $WN,n$ $N$-node $n$-tuple classifying model $c$ Generic class (⁠$-1$ or 1) $M-1$ Memory matrix of class $-1$ $M1$ Memory matrix of class 1 $Mc$ Memory matrix of class $c$ $mc,i,j$ Element at $i$th row and $j$th column of $Mc$ $WN,n·;M-1,M1$ $N$-node $n$-tuple classifier with memory matrices $M-1$ and $M1$ $D-1$ Discriminator for class $-1$ $D1$ Discriminator for class 1 $Dc$ Discriminator for class $c$ $dc,i$ $i$th memory node of $Dc$ $s-1Zk$ Score of $D-1$ for $Zk$ $s1Zk$ Score of $D1$ for $Zk$ $sZk$ Score vector $sZk=s-1Zk,s1ZkT$

#### 3.2.2  Example

The WiSARD of Figure 1b is characterized by its number of RAM nodes, $N=6$; their address length, $n=2$; and the contents of its two discriminators, $D-1$ and $D1$, which are mathematically represented by memory matrices
$M-1=010000100100001000100001$
3.6
and
$M1=001001000010001001000010,$
3.7
respectively. Each row of these matrices corresponds to the content of its respective RAM node. For instance, the first row of $M1$ is $0010$, which is exactly the same sequence of values in the memory positions of $d1,1$.

The $n$-tuple classifier of Figure 1b produces scores $s-1Zk=1$ and $s1Zk=6$, which are the outputs of $D-1$ and $D1$. Because the chosen class is the one whose score is the highest, that $n$-tuple classifier should opt for 1 as the class to be applied to input $xk=10:01:10:10:01:10$.

### 3.3  The WiSARD $n$-Tuple Classifier with Bleaching

The mathematical formulation of the WiSARD $n$-tuple classifier with bleaching focuses on the same concerns as its basic counterpart. The introduction of bleaching threshold $β$ leads to a noticeable change in the definition of the score measure of this classifier.

#### 3.3.1  Notation

Let $BN,n$ be the family of functions that denotes every two-class bleaching $n$-tuple classifier defined by parameters $N$ and $n$ ($B$ stands for bleaching). Let $BN,n·;M-1,M1∈BN,n$ be a function representing a bleaching $n$-tuple classifier instance, whose learned knowledge on any class $c$ is characterized by memory matrix $Mc∈NN×2n$. Like its binary counterpart, every network instance $BN,n·;M-1,M1$ has two discriminators, denoted $D-1$ and $D1$. They are, respectively, represented by $M-1$ and $M1$, the memory matrices that store their learned knowledge. $Mc$ is defined as $Mc=[mc,i,j]N×2n$, where $mc,i,j∈N$.

Let $sβ(Zk)=[s-1β(Zk),s1β(Zk)]T$ be the score vector of the bleaching $n$-tuple classifier for a given fixed threshold $β$. The scores $scβ∈0,1,…,N$ represent the number of nodes of discriminator $Dc$ whose addressed positions store values greater than $β$. So they can be defined as
$scβ=∑i=1N∑j=12n1β,∞mc,i,jzk,i,j,$
3.8
where $1Sx$ is the indicator function that returns 1 if $x∈S$ and 0 otherwise. $mc,i,j$ and $zk,i,j$ are, respectively, elements of matrices $Mc$ and $Zk$.
For any feature matrix $Zk$ mapped from an input $xk$, $BN,n$ returns the corresponding class as
$BN,nZk;M-1,M1=argmaxc∈-1,1scβZk,$
3.9
where $β$ is the smallest threshold that satisfies $s-1βZk≠s1βZk$. In other words, the chosen class is the one whose score is the highest, given a bleaching threshold $β$ that is the lowest one for which there is no tie between the discriminators. Similar to the basic model, if there is no such threshold, the classification system may apply some policy to choose one class or use a draw procedure. The notation introduced here is condensed in Table 3.
Table 3:
Notation for Bleaching $n$-Tuple Classifier.
 $BN,n$ Bleaching $N$-node $n$-tuple classifying model $BN,n·;M-1,M1$ Bleaching $N$-node $n$-tuple classifier with memory matrices $M-1$ and $M1$ $β$ Bleaching threshold $s-1βZk$ Score of $D-1$ for $Zk$ and threshold $β$ $s1βZk$ Score of $D1$ for $Zk$ and threshold $β$ $sβZk$ Score vector $sβZk=s-1βZk,s1βZkT$
 $BN,n$ Bleaching $N$-node $n$-tuple classifying model $BN,n·;M-1,M1$ Bleaching $N$-node $n$-tuple classifier with memory matrices $M-1$ and $M1$ $β$ Bleaching threshold $s-1βZk$ Score of $D-1$ for $Zk$ and threshold $β$ $s1βZk$ Score of $D1$ for $Zk$ and threshold $β$ $sβZk$ Score vector $sβZk=s-1βZk,s1βZkT$

#### 3.3.2  Example

Figure 2 presents a bleaching $n$-tuple classifier with two discriminators, $D-1$ and $D1$. They are mathematically represented by memory matrices
$M-1=020100300210021000300201$
3.10
and
$M1=002102010030003002010021,$
3.11
respectively. One can note that the same relation between memory matrix rows and RAM node contents mentioned in section 3.2.1 applies to the bleaching variants of $M-1$ and $M1$. Figure 2 shows the scores produced by the bleaching $n$-tuple classifier when subjected to a bleaching threshold $β=2$. In this case, the learning machine produces scores $s-12Zk=0$ and $s12Zk=2$.

## 4  The VC Dimension of the Basic WiSARD $n$-Tuple Classifier

Studies on the VC dimension of the basic WiSARD $n$-tuple classifier were first conducted by Bradshaw (1996). The work intended to find generalization bounds on this learning system as a means to allow potential comparisons with other machine learning models, as well as to find ways to improve its generalization and mitigate the saturation problem.

Bradshaw (1996) achieved an exact value for the VC dimension of the maximal discriminator $n$-tuple classifier, that is, a network with a single discriminator, which accepts an input if its score is maximum and rejects it otherwise. Bradshaw (1996) found $dVC=N2n-1$ for this model, where $dVC$ is its VC dimension, and $N$ and $n$ are the number of nodes and the addressing tuple length, respectively.

No exact value was found for the VC dimension of the two-discriminator $n$-tuple classifier. Bradshaw's (1996) studies fixed lower and upper bounds for this dimension, asserting that it should be $N2n-1≤dVC≤log23·N2n$. Finally, Bradshaw (1996) suggested as a conjecture that an exact value for the VC dimension of the two-discriminator architecture is attainable and that it is $dVC=N2n-1$.

It is important to raise some points before entering the proof itself. First, only $n$-tuple classifiers with two discriminators are considered, despite the multiclass potential of the model. This is done because the VC dimension is defined on the idea of dichotomies (see section 2.1). Second, the addressing function is pseudo-randomly generated at the network setup, and it is not changed afterward. Inserting new knowledge into the network does not affect its pseudo-random mapping, so mapping function does not count as an effective parameter of the $n$-tuple classifying learning model, not playing any role on its learning capacity. This way, every time an input observation should be considered, its equivalent feature matrix is used instead. It brings the benefit of easing the comprehension of the proof without jeopardizing it. Third, an arbitrary class must be chosen in the case of a draw. This work uses 1 as the opted class. Finally, we next present a sketch of the proof.

### 4.1  Sketch of the Proof

The proof of the VC dimension is divided in two parts: the proof of its lower bound and the proof of its upper bound. The flowchart in Figure 3 follows the idea of the proof. To prove the first part, a set of points is defined, such that its cardinality is the intended lower bound. Then it is shown that for each possible class attribution, there is one $n$-tuple classifier capable of classifying that set of points accordingly. This proves that the $n$-tuple classifying model can shatter that set of points, concluding this part of the proof. To prove the upper bound, a system of linear inequalities is produced. This system is based on the score measure of equation 3.4. It should represent the classification of a generic set of points into a generic set of classes of same cardinality, higher than the intended upper bound. Next, it is proved that there is no solution to such a system of inequalities. There is thus at least one set of classes in which there is no way to classify a set of points that large, and this set is not shattered. This concludes the proof of the upper bound. Given that the lower and upper bounds are identical, the VC dimension has an exact value.

Figure 3:

Proof flowchart for determination of the VC dimension of the basic WiSARD $n$-tuple classifier.

Figure 3:

Proof flowchart for determination of the VC dimension of the basic WiSARD $n$-tuple classifier.

### 4.2  Lower Bound

As explained in section 4.1, the proof of the lower bound of the VC dimension is attained by construction. Given an ordered set of points $X=x1,x2,…$, with every point $xk$ having an associated feature matrix $Zk$, for every class vector $y=[y1,y2,⋯]T$ of the same cardinality as $X$, there must exist an $n$-tuple classifier $WN,n·;M-1,M1∈WN,n$, such that $WN,nZk;M-1,M1=yk$ for all $k$. $X$ must be such that its cardinality is equal to the intended lower bound for the VC dimension.

Lemma 1.

The VC dimension, $dVC$, of a model $WN,n$, as defined in section 3.2.1, is bounded below by $N2n-1+1$.

Proof.
Let $X$ be an ordered set of input data points defined as
$X=0,1n:0n:…:0n⏞N∪0n:0,1n:…:0n∪…∪0n:0n:…:0,1n$
4.1
and $Z$ its corresponding set of feature matrices.

It is worth noting that $X$ is the set of every input point with at most one non-null addressing $n$-tuple. Its first line refers to all input points whose first addressing $n$-tuple can be any binary sequence, but the remaining ones must be all-zero $n$-tuples. Its second line refers to the input points whose second $n$-tuple can be any sequence of binary values and the remaining ones must be composed only of zeros. The same pattern applies to every other line of $X$.

The main intuition behind the choice for $X$ lies in the fact that it is the largest set of input points that does not allow an undesirable property. It does not have a subset of four points, where each of them differs from two of the other points by a single $n$-tuple—for example, the set composed by $00:00$, $01:00$, $01:01$, and $00:01$. Such a set of points, or any other that contains it, cannot be shattered. There is no WiSARD $n$-tuple classifier that can classify $00:00$ and $01:01$ as 1 and $01:00$ and $00:01$ as $-1$.

For clarity, it is assumed, without loss of generality, that $x1$, the first element of $X$, is $0n:0n:…:0n$, and $Z1$ is its corresponding feature matrix. This way, the matching class of this input should be $y1$. To attest if $WN,n$ shatters $X$, one should, for every class vector $y$, be capable of building an $n$-tuple classifier $WN,n·;M-1,M1∈WN,n$, such that $WN,nZk;M-1,M1=yk$ for all $k$. In order to do it, this procedure is done in two cases, each one related to a distinct set of class vectors: case I, the class vectors in which $y1=1$, and case II, those in which $y1=-1$. In other words, in case I it is attested if $WN,n$ can realize half of all possible dichotomies in $X$, and in case II if $WN,n$ can realize the other half. It is worth noting that the union of both cases comprises all $2cardX$ possible dichotomies that can be realized in set $X$, and so if $WN,n$ can realize both sets of dichotomies in $X$, then $WN,n$ shatters $X$.

Case I,$y1=1$: Given a class vector $y=[1,y2,y3,…]T$, let $M-1$ and $M1$ be memory matrices, whose elements $mc,i,j$ are given by1
$mc,i,j=1,iftk,i≠0naddressesthejthpositionofdc,iandyk=c0,otherwise,$
4.2
and let there be an $n$-tuple classifier $WN,n·;M-1,M1∈WN,n$ whose memory matrices are defined according to equation 4.2.

If an input $xk∈X∖0n:0n:…:0n$ is presented to the classifier, it yields a score vector $sZk=[1,0]T$ if $yk=-1$, or $sZk=[0,1]T$ if $yk=1$. This happens because only one non-null position is addressed. If $WN,n·;M-1,M1$ receives $x1=0n:0n:…:0n$, its score vector is $sZ1=[0,0]T$, which leads to a draw, making the network opt for class 1 as default. In short, for any class vector $y=[1,y2,y3,…]T$, there exists a basic $n$-tuple classifier $WN,n·;M-1,M1$, such that it accurately classifies any $xk∈X∖0n:0n:…:0n$ as $yk$ and also classifies $x1=0n:0n:…:0n$ as $y1=1$.

Case II,$y1=-1$: Given a class vector $y=[-1,y2,y3,…]T$, let $M-1$ and $M1$ be memory matrices, whose elements $mc,i,j$ are given by
$mc,i,j=1,iftk,i≠0naddressesthejthpositionofdc,iandyk=c,orifc,i,j=-1,1,10,otherwise,$
4.3
where $c,i,j=-1,1,1$ represents the memory position of $d-1,1$ that is addressed by $0n$. Let there also be an $n$-tuple classifier $WN,n·;M-1,M1∈WN,n$ whose memory matrices are defined according to equation 4.3.

Equation 4.3 differs from equation 4.2 in a single element. This difference alters the score vector $s$ by adding 1 to $s-1$ if the first $n$-tuple of an input $xk$ is $0n$. In other words, if an input $xk∈{{0,1}n:0n:…:0n}∖{0n:0n:…:0n}$ is presented to the classifier, it yields the same score vectors it did in case I: $sZk=1,0T$ if $yk=-1$, or $sZk=0,1T$ if $yk=1$. For the other inputs, the score vector changes as follows. In case I, when the classifier received $x1=0n:0n:…:0n$, it yielded score vector $sZ1=0,0T$, leading to a draw. In case II, no draw happens because the classifier outputs score vector $sZ1=1,0T$, favoring class $-1$. The remaining possible inputs are those of type $xk∈X∖{{0,1}n:0n:…:0n}$. When subjected to these inputs, the classifier in case I produced score vectors $sZk=1,0T$ if $yk=-1$ and $sZk=0,1T$ if $yk=1$. Because the first $n$-tuple of $xk$ is $0n$, the classifier in case II generates $sZk=2,0T$ if $yk=-1$, and $sZk=1,1T$ if $yk=1$. The former favors its corresponding class because $s-1Zk>s1Zk$. In the latter, a tie happens, and as default, the network opts for class 1. In short, for any class vector $y=[-1,y2,…]T$, there exists a basic $n$-tuple classifier $WN,n·;M-1,M1$, such that (1) it accurately classifies any $xk∈{{0,1}n:0n:…:0n}∖{0n:0n:…:0n}$ as $yk$, (2) it does the same for $xk∈X∖{{0,1}n:0n:…:0n}$, and (3) it classifies $x1=0n:0n:…:0n$ as $y1=-1$.

In summary, for every class vector $y$, there exists an $n$-tuple classifier $WN,n·;M-1,M1∈WN,n$, such that $WN,nZk;M-1,M1=yk$ for all $k$. That is, $WN,n$ realizes all $2cardX$ dichotomies of $X$. So $WN,n$ shatters $X$, and the VC dimension of this model is bounded below by $cardX=N2n-N-1=N2n-1+1$.

$□$

As a graphical representation of this proof, we offer examples for each particular case ($y1=1$ and $y1=-1$). The examples describe dichotomies that a $W2,2$$n$-tuple classifying model ($N=2$ and $n=2$) realizes on a set of input points. Applying $N=2$ and $n=2$ to set $X$, displayed in equation 4.1, one gets
$X2,2={00:00,01:00,10:00,11:00,00:01,00:10,00:11}.$
4.4
Two of the $27$ dichotomies $W2,2$ realizes on $X2,2$ are presented, one for each particular case. In both chosen dichotomies, $10:00$, $00:10$ and $00:11$ are classified as 1 and $01:00$, $11:00$ and $00:01$ as $-1$. Input point $00:00$ (the 2-node 2-tuple representation of input $x1=0n:0n:…:0n$) is classified as $y1$, which is 1 in case I and $-1$ in case II. These dichotomies are set out in Table 4.
Table 4:
Dichotomies of $X2,2$.
 Classified as 1 $-1$ $y1$ $10:00$ $01:00$ $00:10$ $11:00$ $00:00$ $00:11$ $00:01$
 Classified as 1 $-1$ $y1$ $10:00$ $01:00$ $00:10$ $11:00$ $00:00$ $00:11$ $00:01$

The dichotomies displayed in Table 4 imply the WiSARD configurations of Figures 4a and 4b, which are produced by equations 4.2 and 4.3, respectively. Because the corresponding class of inputs $10:00$, $00:10$ and $00:11$ is 1, position 10 of $d1,1$ and positions 10 and 11 of $d1,2$ store the value 1 in both figures. Analogously, inputs $01:00$, $11:00$, and $00:01$ are associated with class $-1$, and so positions 01 and 11 of $d-1,1$ and position 01 of $d-1,2$ also store 1 in both figures. Finally, in Figure 4b, the content of position 00 of $d-1,1$ is also 1, due to equation 4.3.

Figure 4:

WiSARD configurations.

Figure 4:

WiSARD configurations.

Five inputs are provided for this example, $00:00$, $01:00$, $10:00$, $00:01$, and $00:10$. The first one is the 2-node 2-tuple representation of $x1=0n:0n:…:0n$. The classifier in case I (depicted in Figure 4a) outputs score vector $sZ1=0,0T$, and that in case II (displayed in Figure 4b) outputs $sZ1=1,0T$. The generation of those score vectors is illustrated in Figure 5. The second and third inputs are of type $xk∈{{0,1}n:0n:…:0n}∖{0n:0n:…:0n}$, respectively classified as $-1$ and 1. Both classifiers would produce the score vector $sZk=1,0T$ for $01:00$ and $sZk=0,1T$ for $10:00$ (see Figures 6 and 7). The last two examples are inputs of type $xk∈X∖{{0,1}n:0n:…:0n}$, also respectively classified as $-1$ and 1. In case I, the classification of these inputs lead to the same score vectors yielded from the classification of $01:00$ (see Figures 6a and 8a) and $10:00$ (see Figures 7a and 9a). In case II, the score vectors generated this way should be $sZk=2,0T$ for $00:01$ and $sZk=1,1T$ for $00:10$, as can be checked in Figures 8b and 9b, respectively.

Figure 5:

Classification of $00:00$.

Figure 5:

Classification of $00:00$.

Figure 6:

Classification of $01:00$.

Figure 6:

Classification of $01:00$.

Figure 7:

Classification of $10:00$.

Figure 7:

Classification of $10:00$.

Figure 8:

Classification of $00:01$.

Figure 8:

Classification of $00:01$.

Figure 9:

Classification of $00:10$.

Figure 9:

Classification of $00:10$.

### 4.3  Upper Bound

The VC dimension upper bound of an $n$-tuple classifying model is obtained through a linear algebra approach. In section 4.1, we advanced that a system of linear inequalities should be derived from the score formula, presented in equation 3.4. The relation between the values of scores $s-1Zk$ and $s1Zk$ plays a vital role in determining the suitable class for a given input, and this is the main intuition behind the development of that system of inequalities.

Proposition 1.
An $n$-tuple classifying model $WN,n$ shatters a set of feature matrices $Z,cardZ=ℓ$, if and only if given a generic class vector $y∈{-1,1}ℓ$, there exists a set of memory matrix elements $mc,i,j∈{0,1}$ that solves
$∑i=1N1-∑j=22nzk,i,jm1,i,1-m-1,i,1+∑i=1N∑j=22nzk,i,jm1,i,j-m-1,i,j=ykμk-12,$
4.5
for all $k∈{1,…,ℓ}$ and some $μk∈12,32,…,2N+12$, where $zk,i,j$ and $yk$ are defined as in section 3.1.1.
Proof.
Let $νk$ be the difference between scores $s1Zk$ and $s-1Zk$, that is,
$s1Zk-s-1Zk=νk,∑i=1N∑j=12nzk,i,jm1,i,j-m-1,i,j=νk.$
4.6
Because both $zk,i,j$ and $mc,i,j$ can be either 1 or 0, each addend of equation 4.6 can be $-1$, 0, or 1. But it is worth reminding that for any $i$, $zk,i,j=1$ for exactly one $j$ and is otherwise 0. As a consequence, $νk$ can assume any integer value in the interval $-N,N$. If it is negative, the classifier should choose class $-1$. Otherwise, it should choose 1 (even if there is a draw). Also, given the property concerning the elements $zk,i,j∈Zk$, it is straightforward that
$zk,i,q=1-∑j=1j≠q2nzk,i,j.$
4.7
In order to represent $νk$ as a function of label $yk$, variable $μk∈12,32,…,2N+12$ is introduced. This way,
$νk=ykμk-12.$
4.8
This way, one gets the difference of the scores as a linear function of $yk$. This can be checked by verifying that when $yk=1$, $νk∈{0,1,…,N}$, that is, any possible nonnegative score difference between discriminators with $N$ memory nodes. It comprises the case of a tie between discriminators ($νk=0$) up to the case where every node of $D1$ fires and none of $D-1$ does ($νk=N$). When $yk=-1$, $νk∈{-1,-2,…,-N-1}⊃{-1,-2,…,-N}$, that is, any possible negative score difference between discriminators with $N$ memory nodes. It comprises all possible cases where the score of $D-1$ is greater than that of $D1$, up to the case where every node of $D-1$ fires and none of $D1$ does ($νk=-N$). Although identity 4.8 allows $νk$ to be $-N-1$, it does not truly occur. It poses no problem, nevertheless, as every score difference $νk$ can be successfully represented as a linear function of its respective class $yk$, as it was intended.
Combining equations 4.6, 4.7 (for $q=1$ without loss of generality), and 4.8,
$∑i=1N1-∑j=22nzk,i,jm1,i,1-m-1,i,1+∑i=1N∑j=22nzk,i,jm1,i,j-m-1,i,j=ykμk-12.$
4.9

So $M-1$ and $M1$ must be such that equation 4.9 holds true for an $n$-tuple classifier $WN,n·;M-1,M1$ to correctly categorize a feature matrix $Zk$. As an immediate consequence, if there exist $M-1$ and $M1$ such that equation 4.9 holds true for all $k∈{1,…,ℓ}$, then $WN,n·;M-1,M1$ accurately assigns every feature matrix $Zk∈Z$ to its corresponding class $yk∈y$, because the score difference will agree in sign with $yk$, regardless of the value of $μk$.

Finally, if those same memory matrices satisfy equation 4.9 for all $k∈{1,…,ℓ}$ and for any generic class vector $y∈{-1,1}ℓ$, then for all $2ℓ$ possible instances of $y$, $WN,n·;M-1,M1$ can precisely classify every $Zk∈Z$. In other words, model $WN,n$ shatters $Z$.

$□$

Next, it is intended to show that if the number $ℓ$ of equations of the existing system (see equation 4.5) is larger than a given upper bound, then there are no memory matrices $M-1$ and $M1$ that satisfy it for every possible set of classes $y∈{-1,1}ℓ$.

Proposition 2.

There exist no memory matrices $M-1$ and $M1$ for which the system of equations 4.5 holds true for a generic set of classes $y∈{-1,1}ℓ$ if the number of equations $ℓ>N2n-1+1$.

Proof.

The system of equations 4.5 is linear in $mc,i,j$, and so it can be expressed in the form $Ax=b$. According to Rouché-Capelli theorem (Meyer, 2000), a system of linear equations has a solution if and only if the rank of its coefficient matrix $A$ is equal to the rank of its augmented matrix $A∣b$. Because many columns of the coefficient matrix of system 4.5 can be produced through linear combinations of others, eliminations must be done to determine its rank.

At first, the system of equations has $N2n+1$ variables, and so the rank of its coefficient matrix is at most that same value. The coefficients of $m-1,i,j$ and $m1,i,j$ are, respectively, $-zk,i,j$ and $zk,i,j$ when $j≥2$ and $-1-∑j=22nzk,i,j$ and $1-∑j=22nzk,i,j$ when $j=1$. So there is a linear dependency between the coefficients of $m-1,i,j$ and $m1,i,j$, and then the coefficients of all variables $m-1,i,j$ can be eliminated through a linear combination to those of $m1,i,j$. This way, $N2n$ columns are linearly dependent on others (one column for every pair $i,j$, $1≤i≤N$ and $1≤j≤2n$).

Further eliminations can be done. For every $i≥2$, the coefficients of $m1,i,1$$zk,i,1=1-∑j=22nzk,i,j$ can be represented as a linear combination of those of $m1,1,1$$zk,1,1=1-∑j=22nzk,1,j$ and those of $m1,1,j$ ($zk,1,j$) and $m1,i,j$ ($zk,i,j$) for every $j≥2$. This can be verified in
$1-∑j=22nzk,i,j=1-∑j=22nzk,1,j+∑j=22nzk,1,j-∑j=22nzk,i,j.$
4.10
This procedure determines that there is one column that is linearly dependent on others for every $i≥2.$ This way, $N-1$ columns are linearly dependent on others and can be eliminated.

In summary, two significant elimination processes were made to determine that there are at most $N2n+1-N2n-N-1=N2n-1+1$ linearly independent columns. Therefore, when the system of equations 4.5 has more than $N2n-1+1$ equations, the rank of its coefficient matrix is at most $N2n-1+1$. It is worth noting that no further linear relations could be retrieved, because it would drive the VC dimension below the lower bound determined by Lemma 5. In other words, no $n$-tuple classifier like the one assembled in that lemma could exist, which is clearly impossible.

Because there is no prior linear relation between class vector $y$ and the ordered set of feature matrices $Z$, the rank of the augmented matrix is greater than that of the coefficient matrix if their corresponding system has more than $N2n-1+1$ equations. Hence, there is no solution for such a system of linear equations.

$□$

Due to the results of propositions 6 and 7, it is straightforward to determine an upper bound for the VC dimension of the basic $n$-tuple classifier:

Lemma 2.

The VC dimension, $dVC$, of a model $WN,n$, as defined in section 3.2.1, is bounded above by $N2n-1+1$.

Proof.

There is no solution for the system of equations 4.5 if it has at least $N2n-1+2$ equations. It implies that $WN,n$ does not shatter any set of feature matrices $Z$, $cardZ=N2n-1+2$.

$□$

Theorem 1.

The VC dimension, $dVC$, of a model $WN,n$, as defined in section 3.2.1, is given by $N2n-1+1$.

Proof.

The VC dimension of $WN,n$ is bounded above and below by $N2n-1+1$. Then, it is exactly $N2n-1+1$.

$□$

## 5  The VC Dimension of the WiSARD with Bleaching

The bleaching technique provided an advantage to the $n$-tuple classifier by mitigating its proneness to saturation. Results on experiments conducted with this architecture indicated a noticeable improvement to what could be achieved by the basic learning model (Grieco et al., 2010; de Souza et al., 2014, 2015; Carneiro et al., 2015; Nascimento et al., 2015; Cardoso et al., 2016). Historically, the $n$-tuple classifier accuracy tended to worsen if the amount of data to be learned by the system was large enough. The capability of learning from massive data without getting saturated showed that the classifier could achieve higher accuracies (with lower variance) as more data were presented to it. The model, however, lacked a theoretical foundation. A study on its generalization capacity could bring a better comprehension on how the network works and also could provide insights into how it could be further improved.

The proof of the VC dimension of the bleaching $n$-tuple classifier is divided in two parts, the proof of its lower bound and the proof of its upper bound, as depicted in Figure 10. The lower bound is immediately obtained from the fact that the bleaching $n$-tuple classifier is a generalization of its basic counterpart. The discriminators of the bleaching $n$-tuple classifier yield scores that depend on a given threshold $β$. The proof of the upper bound introduces a new score measure, which does not depend on a bleaching threshold. After it is properly defined, it is important to prove that this score really does reproduce the intended outcome of the class selection criterion of the two-class bleaching $n$-tuple classifier, as introduced in section 2.2.2.

Figure 10:

Proof flowchart for determination of the VC dimension of bleaching $n$-tuple classifier.

Figure 10:

Proof flowchart for determination of the VC dimension of bleaching $n$-tuple classifier.

The use of this score measure leads to a system of linear equations, similar to the one of proposition 6. By a procedure similar to that of proposition 7, one deduces that the system of equations has no solution if it has more equations than the intended upper bound of the VC dimension. Therefore, this upper bound is attained. Again the lower and upper bounds are the same, and so the VC dimension of the bleaching $n$-tuple classifier has an exact value.

The proof of the VC dimension of the bleaching $n$-tuple classifying model starts by proving its lower bound:

Corollary 1.

The VC dimension, $dVC$, of a model $BN,n$, as defined in section 3.3.1, is bounded below by $N2n-1+1$.

Proof.

The bleaching $n$-tuple classifier is a generalization of the basic WiSARD, for the former works exactly like the latter if its memory matrices store only binary values. Consequently, $dVC$ is bounded below by $N2n-1+1$, the VC dimension of the basic $n$-tuple classifier as indicated by theorem 9.

$□$

The proof of the upper bound of the VC dimension of the bleaching learning model needs the introduction of a score measure that does not depend on the bleaching threshold $β$ for the reasons exposed in the beginning of this section.

Definition 5.

Let $Λ-1=[λ-1,i,j]N×2n$ and $Λ1=[λ1,i,j]N×2n$ be matrices with the same dimensions as $M-1$ and $M1$, respectively. For a given class $c∈{-1,1}$, the elements of $Λc$ are defined as $λc,i,j=-N-mc,i,j$.

Definition 6.
Let $sZk=s-1Zk,s1ZkT$ be a $β$-independent score vector. For a given class $c∈{-1,1}$,
$scZk=∑i=1N∑j=12nλc,i,jzk,i,j=-∑i=1N∑j=12nN-mc,i,jzk,i,j.$
5.1
Corollary 2.
$BN,nZk;M-1,M1=argmaxc∈{-1,1}scZk.$
5.2
Proof.

Let $v-1,i$ and $v1,i∈N$ respectively be the values of the accessed position of $i$th node of $D-1$ and $D1$, hereafter denoted as node scores. Suppose, without loss of generality, that $v-1,i≥v-1,j$ and $v1,i≥v1,j$ if $i, that is, nodes of both discriminators are sorted so that the highest values of the accessed positions correspond to the ones with lowest indices.

Suppose also, without loss of generality, that $BN,nZk;M-1,M1=1$, that is, 1 was the class chosen by the classifier. According to the mathematical formulation given in section 3.3.1, there is a bleaching threshold $β$ that is the smallest one that satisfies $s1βZk>s-1βZk$.

For a bleaching $n$-tuple classifier to produce such scores, there must exist an $N'≤N$, such that the learning machine has the following node scores, that is, (1) $v1,i≥β+1$ and $v-1,i≥β$, for every $i; (2) $v1,N'≥β+1$; (3) $v-1,N'=β$; and (4) $v1,i=v-1,i≤β$, for every $i>N'$. A classifier with these node scores opts for class 1 (as intended), because $β$ is the smallest threshold for which there is a tie break, with $s1βZk=N'$ and $s-1βZk≤N'-1$.

Because it was supposed, without loss of generality, that the classifier would select class 1, then $β$-independent score $s1Zk$ must be greater than $s-1Zk$ for the corollary statement to be proven true. As
$s1Zk≥-∑i=N'+1NN-v1,i-N'N-β-1,$
5.3
$s-1Zk<-∑i=N'+1NN-v-1,i-N-β$
5.4
and
$v1,i=v-1,i≤β,foreveryi>N',$
5.5
then
$s1Zk-s-1Zk>-N'N-β-1+N-β≥-N-β+N-β=0.$
5.6
That is, $s1Zk>s-1Zk$.

Supposing the selected class to be $-1$ would lead to an analogous proof that would result in $s-1Zk, as expected. Finally, for a draw scenario, every node score $v1,i$ should be equal to $v-1,i$, for every $i$. This way, $β$-independent scores $s1Zk$ and $s-1Zk$ would also be equal.

Summarizing, when $s1Zk>s-1Zk$, the network opts for class 1; if $s-1Zk>s1Zk$, it chooses $-1$; and if both $β$-independent scores are identical, the classifier outputs that there is a draw. Thus, it confirms the identity proposed by this corollary.

$□$

The use of $β$-independent scores makes the function $BN,n·;M-1,M1$ equivalent to $WN,n·;Λ-1,Λ1$ (compare definition 12 and corollary 2 to equations 3.4 and 3.5). Consequently, a system of linear equations with similar characteristics to equation 4.5 can be derived from the relation between the $β$-independent scores, as noted at the beginning of this section.

Lemma 3.

The VC dimension, $dVC$, of a model $BN,n$, as defined in section 3.3.1, is bounded above by $N2n-1+1$.

Proof.

Analogous to proposition 6, the formulas of $β$-independent score measures $s-1·$ and $s1·$ lead to a system of linear equations, whose coefficient matrix is the same. The column that represents the right-hand side of the equations is also linearly independent of those of the coefficient matrix. Therefore, such a system lacks a solution if it has at least $N2n-1+2$ equations. Then, $BN,n$ does not shatter any set of feature matrices $Z,cardZ=N2n-1+2$.

$□$

Theorem 2.

The VC dimension, $dVC$, of a model $BN,n$, as defined in section 3.3.1, is given by $N2n-1+1$.

Proof.

The VC dimension of $BN,n$ is bounded above and below by $N2n-1+1$. Then it is exactly $N2n-1+1$.

$□$

## 6  Discussion

Even an exact VC dimension for WiSARD does not provide a good rule of thumb to define $N$ and $n$. Empirical results from applications of the WiSARD $n$-tuple classifier strongly suggest that the data set size could be used to further the analysis of WiSARD learning capacity, even though the model does not require regularization. Another related topic is that of the sparsity of the data set elements.

Calculations on the VC dimensions of WiSARD and bleaching $n$-tuple classifiers show that their values are the same, indicating that the bleaching technique mitigates the saturation and does little to no harm to the network generalization capacity. Their values, however, are higher than what would be expected from previous experimental work. It is worth noting that Bradshaw's (1996) conjecture was very close to the exact value found in this work. For instance, the WiSARD network employed in Carneiro et al. (2017) has discriminators with more than 95 nodes, each one addressed by $n$-tuples of length 88. The VC dimension of this learning machine is approximately $3·1028$, which is far greater than the size of the largest data set it was subjected to (fewer than $1.3·106$ elements). The determination of the VC dimension of basic and bleaching $n$-tuple classifiers was performed for dichotomous classifications. Despite the polychotomous nature of part-of-speech tagging, the analysis here given provides insight into the learning capacity of those learning machines in a practical application. Yet, classically in the literature, multiclass classification is often executed through algorithms that use dichotomous classifications, such as multiclass support vector machines (SVMs) (Hsu & Lin, 2002).

The VC dimensions of both models suggest a similarity between them and other learning machines whose input is transformed into a feature vector, which is affected by the model effective parameters to produce the system output. For instance, an extreme learning machine (ELM; Huang, 2003; Huang, Wang, & Lan, 2011) has pseudo-randomly generated weights that connect its input layer to its hidden one. These weights are analogous to the addressing function introduced in section 3.1.1. In addition, the content of the hidden-layer neurons is affected by a weighted sum, which produces the ELM output. A similar process happens in WiSARD and bleaching $n$-tuple classifiers, where the feature vector generated by the addressing function is subjected to the system memory matrices in order to yield the network response. A comparison between the VC dimensions of those models also indicates a similarity between them. The VC dimension of ELMs is determined by the number of weights connecting its hidden to its output layer (Liu, Gao, & Li, 2012). In a similar fashion, the number of addressable positions of an $n$-tuple classifier is tightly related to its VC dimension.

Despite their similarities, the weightless recognition methods differ in some points from their weighted counterparts. The latter ones rely on optimization methods and update all their weights in a training procedure, while the former ones change the memory matrix elements that are addressed only when the network is being trained. In this way, WiSARD and bleaching $n$-tuple classifiers can generalize pretty well even if the number of training data are far smaller than their VC dimensions. The same does not happen in ELMs (Liu et al., 2012). Actually, both weightless learning systems seem to have fewer effective parameters than their VC dimensions imply. They have a pay-only-for-what-you-use policy, where the only positions to be updated are the ones that are revealed as relevant during the training procedure.

One might get a glimpse of the impact of this policy by calculating the VC dimension of a particular subset of WiSARD and bleaching $N$-node $n$-tuple classifying models. If a WiSARD network is exposed to a training data set $D$ composed of $ω1$ observations associated with class 1 and $ω-1$ with class $-1$, then every node $d1,i$ of discriminator $D1$ should have at most $ω1$ positions storing 1. Analogously, every node $d-1,i$ of $D-1$ should have at most $ω-1$ positions storing 1. This way, by appending inequalities
$∑j=12nmc,i,j≤ωc,∀i∈{1,…,N}andc∈{-1,1}$
6.1
to the system introduced in proposition 6, one could obtain a capacity measure lower than the calculated VC dimension, which would better depict how that policy affects WiSARD generalization capacity. The bleaching $n$-tuple classifier has a very characteristic property. If a learning machine of this sort is exposed to that same training data set, then the sum of the contents of every node $dc,i$ of a discriminator $Dc$ is always $ωc$. To attain the desired measure, it would be necessary to append equations
$∑j=12nmc,i,j=ωc,∀i∈{1,…,N}andc∈{-1,1}$
6.2
to its equivalent linear system. But because the system relies on the modified memory matrices $Λ-1$ and $Λ1$, and not on $M-1$ and $M1$, appending those equations would introduce a nonlinearity to the system that should toughen its resolution. That task becomes easier if those equations are replaced by
$-2nN-2-nωc≤∑j=12nλc,i,j≤-2n+1-N-ωc,∀i∈{1,…,N}andc∈{-1,1}.$
6.3
However, this might pose another problem, as it is quite improbable to find an exact solution, so this measure would have to be expressed by lower and upper bounds.

## 7  Conclusion

This letter provides the exact values of the VC dimension of WiSARD and of bleaching $n$-tuple classifiers. Because they are identical, we conclude that the bleaching technique enhances the WiSARD model while doing little to no harm to its generalization capability. This work also draws a parallel with other pattern recognition machines, showing similarities and differences between these and the weightless systems. We argue as well on how the pay-only-for-what-you-use policy of the $n$-tuple classifiers could be interpreted as an advantage, for it would imply that the model generalization capability should grow according to the data set employed for training.

Other selection criteria can be used for the dynamic threshold $β$ of the bleaching $n$-tuple classifier, implying the possibility of different tie-breaking policies other than the one introduced in section 2.2.2 and formally defined in section 3.3.1. The study of the generalization capacity of those other policies and the investigation on which one is best suited for a given task is an almost direct consequence of this research and is proposed as future work. A deeper analysis of the impact of the training data set size on the generalization capacity of WiSARD and bleaching $n$-tuple classifiers is also suggested as a further improvement to this work. Because of their multidiscriminator natures, the basic and bleaching recognition machines could benefit from studies on their generalization capacities employing multicategorical extensions to the VC dimension.

## Note

1

Equation 4.2 represents matrices $M-1$ and $M1$ that define a WiSARD model $WN,n·;M-1,M1∈WN,n$ that realizes one particular dichotomy of $X$. It shows some resemblance to the basic $n$-tuple method training rule except that in equation 4.2, not all addressed memory positions are set to 1, only those whose addressing tuple is not $0n$.

## Acknowledgments

The authors would like to thank CNPq, FAPERJ, and FINEP Brazilian research agencies. This study was also financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. The authors would also like to thank Nicholas P. Bradshaw, Ph.D., for his contribution and support.

## References

Abu-Mostafa
,
Y. N.
,
Magdon-Ismail
,
M.
, &
Lin
,
H.-T.
(
2012
).
Learning from Data
.
AMLBook
.
Aleksander
,
I.
,
Thomas
,
W. V.
, &
Bowden
,
P. A.
(
1984
).
WISARD: A radical step forward in image recognition
.
Sensor Review
,
4
,
120
124
.
Azhar
,
H. B.
, &
Dimond
,
K.
(
2004
). A stochastic search algorithm to optimize an N-tuple classifier by selecting its inputs. In
A.
Campilho
&
M.
Kamel
(Eds.),
Lecture Notes in Computer Science: vol. 3211. Image analysis and recognition
(pp.
556
563
).
Berlin
:
Springer
.
Bledsoe
,
W. W.
, &
Bisson
,
C. L.
(
1962
).
Improved memory matrices for the $n$-tuple pattern recognition method
.
IRE Transactions on Electronic Computers
,
EC-11
(
3
),
414
415
.
Bledsoe
,
W. W.
, &
Browning
,
I.
(
1959
).
Pattern recognition and reading by machine
. In
Papers of the Eastern Joint IRE-AIEE-ACM Computer Conference
(pp.
225
232
).
New York
:
ACM
.
,
N. P.
(
1996
).
An analysis of learning in weightless neural systems
. Ph.D. diss.,
Imperial College London
.
,
N. P.
(
1997
).
The effective VC dimension of the n-tuple classifier
. In
W.
Gerstner
,
A.
Germond
,
M.
Hasler
, &
J.-D.
Nicoud
(Eds.),
Artificial Neural Networks—ICANN'97
(pp.
511
516
). Lecture Notes in Computer Science, Vol.
1327
.
Berlin/Heidelberg
:
Springer
.
,
N. P.
, &
Aleksander
,
I.
(
1996
).
Improving the generalization of the $n$-tuple classifier using the effective VC dimension
.
Electronic Letters
,
32
(
20
),
1904
1905
.
Cardoso
,
D. O.
,
Carvalho
,
D. S.
,
Alves
,
D. S. F.
,
de Souza
,
D. F. P.
,
Carneiro
,
H. C. C.
,
Pedreira
,
C. E.
,
Lima
,
P. M. V.
, &
França
,
F. M. G.
(
2016
).
Financial credit analysis via a clustering weightless neural classifier
.
Neurocomputing
,
183
,
70
78
.
Carneiro
,
H. C. C.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2015
).
Multilingual part-of-speech tagging with weightless neural networks
.
Neural Networks
,
66
,
11
21
.
Carneiro
,
H. C. C.
,
Pedreira
,
C. E.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2017
).
A universal multilingual weightless neural network tagger via quantitative linguistics
.
Neural Networks
,
91
,
85
101
.
Carvalho
,
D. S.
,
Carneiro
,
H. C. C.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2013
). B-bleaching: Agile overtraining avoidance in the WiSARD weightless neural classifier. In
M.
Verleysen
(Ed.),
Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
(pp.
515
520
).
Louvain-la-Neuve, Belgium
:
i6doc
.
de Souza
,
D. F. P.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2014
).
Spatio-temporal pattern classification with Kernel Canvas and WiSARD
. In
Proceedings of the 2014 Brazilian Conference on Intelligent Systems
.
Piscataway, NJ
:
IEEE
.
de Souza
,
D. F. P.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2015
).
Real-time music tracking based on a weightless neural network
. In
Proceedings of the 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems
.
Piscataway, NJ
:
IEEE
.
Gregorio
,
M. D.
(
1997
).
On the reversibility of multi-discriminator systems
(Technical Report 125/97).
Pozzuoli, Italy
:
Istituto di Cibernetica
.
Grieco
,
B. P. A.
,
Lima
,
P. M. V.
,
Gregorio
,
M. D.
, &
França
,
F. M. G.
(
2010
).
Producing pattern examples from “mental” images
.
Neurocomputing
,
73
(
7–9
),
1057
1064
.
Hsu
,
C.-W.
, &
Lin
,
C.-J.
(
2002
).
A comparison of methods for multi-class support vector machines
.
IEEE Transactions on Neural Networks
,
13
(
2
),
415
425
.
Huang
,
G.-B.
(
2003
).
Learning capability and storage capacity of two-hidden-layer feedforward networks
.
IEEE Transactions on Neural Networks
,
14
(
2
),
274
281
.
Huang
,
G.-B.
,
Wang
,
D. H.
, &
Lan
,
Y.
(
2011
).
Extreme learning machines: A survey
.
International Journal of Machine Learning and Cybernetics
,
2
(
2
),
107
122
.
Jørgensen
,
T. M.
, &
Linneberg
,
C.
(
1999
).
Theoretical analysis and improved decision criteria for the $n$-tuple classifier
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
21
(
4
),
336
347
.
Linneberg
,
C.
, &
Jørgensen
,
T. M.
(
1999
).
Cross-validation techniques for $n$-tuple-based neural networks
. In
T.
,
M. L.
, &
J. M.
Kinser
(Eds.),
Proceedings of the Ninth Workshop on Virtual Intelligence/Dynamic Neural Networks
.
Bellingham, WA
:
International Society of Optical Engineering
.
Liu
,
X.
,
Gao
,
C.
, &
Li
,
P.
(
2012
).
A comparative analysis of support vector machines and extreme learning machines
.
Neural Networks
,
33
,
58
66
.
Meyer
,
C. D.
(
2000
).
Matrix analysis and applied linear algebra
.
:
Society for Industrial and Applied Mathematics
.
Mitchell
,
R. J.
,
Bishop
,
J. M.
, &
Minchinton
,
P. R.
(
1996
).
Optimising memory usage in $n$-tuple neural networks
.
Mathematics and Computers in Simulation
,
40
(
5–6
),
549
563
.
Nascimento
,
D. N.
,
de Carvalho
,
R. L.
,
Mora-Camino
,
F.
,
Lima
,
P. M. V.
, &
França
,
F. M. G.
(
2015
).
A WiSARD-based multi-term memory framework for online tracking of objects
. In
M.
Verleysen
(Ed.),
Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
(pp.
19
24
).
Louvain-la-Neuve, Belgium
:
i6doc
.
Rohwer
,
R.
, &
Morciniec
,
M.
(
1996
).
A theoretical and experimental account of $n$-tuple classifier performance
.
Neural Computation
,
8
(
3
),
629
642
.
Rohwer
,
R.
, &
Morciniec
,
M.
(
1997
). The generalisation cost of RAMnets. In
M.
Mozer
,
M.
Jordan
, &
T.
Petsche
(Eds.),
Advances in neural information processing systems, 9
(pp.
253
259
).
Cambridge, MA
:
MIT Press
.
Rohwer
,
R.
, &
Morciniec
,
M.
(
1998
).
The theoretical and experimental status of the $n$-tuple classifier
.
Neural Networks
,
11
(
1
),
1
14
.
Roy
,
R. J.
, &
Sherman
,
J.
(
1967
).
Two viewpoints of $k$-tuple pattern recognition
.
IEEE Transactions on Systems Science and Cybernetics
,
3
(
2
),
117
120
.
Steck
,
G. P.
(
1962
).
Stochastic model for the Browning-Bledsoe pattern recognition scheme
.
IRE Transactions on Electronic Computers
,
EC-11
(
2
),
274
282
.
Stonham
,
T. J.
(
1977
).
Improved Hamming-distance analysis for digital learning networks
.
Electronics Letters
,
13
(
6
), 155.
Tarling
,
R.
, &
Rohwer
,
R.
(
1993
).
Efficient use of training data in the $n$-tuple recognition method
.
Electronics Letters
,
29
(
24
), 2093.
Ullmann
,
J. R.
(
1969
).
Experiments with the $n$-tuple method of pattern recognition
.
IEEE Trans. Comput.
,
18
(
12
),
1135
1137
.
Ullmann
,
J. R.
(
1971
).
Reduction of the storage requirements of Bledsoe and Browning's n-tuple method of pattern recognition
.
Pattern Recognition
,
3
(
3
),
297
306
.
Vapnik
,
V. N.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.
Vapnik
,
V. N.
, &
Chervonenkis
,
A. Y.
(
1971
).
On the uniform convergence of relative frequencies of events to their probabilities
.
Theory of Probability and Its Applications
,
16
(
2
),
264
280
.
Wickert
,
I.
, &
França
,
F. M. G.
(
2001
). AUTOWISARD: Unsupervised modes for the WISARD. In
J.
Mira
&
A.
Prieto
(Eds.),
Connectionist models of neurons, learning processes, and artificial intelligence
(pp.
435
441
).
Berlin
:
Springer
.