Abstract

The Wilkie, Stonham, and Aleksander recognition device (WiSARD) n-tuple classifier is a multiclass weightless neural network capable of learning a given pattern in a single step. Its architecture is determined by the number of classes it should discriminate. A target class is represented by a structure called a discriminator, which is composed of N RAM nodes, each of them addressed by an n-tuple. Previous studies were carried out in order to mitigate an important problem of the WiSARD n-tuple classifier: having its RAM nodes saturated when trained by a large data set. Finding the VC dimension of the WiSARD n-tuple classifier was one of those studies. Although no exact value was found, tight bounds were discovered. Later, the bleaching technique was proposed as a means to avoid saturation. Recent empirical results with the bleaching extension showed that the WiSARD n-tuple classifier can achieve high accuracies with low variance in a great range of tasks. Theoretical studies had not been conducted with that extension previously. This work presents the exact VC dimension of the basic two-class WiSARD n-tuple classifier, which is linearly proportional to the number of RAM nodes belonging to a discriminator, and exponentially to their addressing tuple length, precisely N(2n-1)+1. The exact VC dimension of the bleaching extension to the WiSARD n-tuple classifier, whose value is the same as that of the basic model, is also produced. Such a result confirms that the bleaching technique is indeed an enhancement to the basic WiSARD n-tuple classifier as it does no harm to the generalization capability of the original paradigm.

1  Introduction

The Wilkie, Stonham, and Aleksander recognition device (WiSARD; Aleksander, Thomas, & Bowden, 1984) is a versatile weightless artificial neural network (WANN). Its origins date back to the n-tuple classifier, which was initially proposed by Bledsoe and Browning (1959) and then formally defined in Steck (1962). WiSARD has a modular architecture and trains unseen patterns in a single pass, making it an efficient learning machine. However, little attention was given to this model because it could not handle large loads of data. The bleaching technique (Grieco, Lima, Gregorio, & França, 2010), a recent enhancement developed for WiSARD, solved this issue. Applications built employing this technique indicated their ability to handle large loads of data (Carneiro, França, & Lima, 2015; Carneiro, Pedreira, França, & Lima, 2017; Cardoso et al., 2016). When compared with state-of-the-art systems, these applications showed competitive results, sometimes outperforming them.

Theoretical research and architecture improvements were performed on the basic WiSARD n-tuple classifier (Bledsoe & Bisson, 1962; Roy & Sherman, 1967; Ullmann, 1969, 1971; Stonham, 1977; Tarling & Rohwer, 1993; Bradshaw & Aleksander, 1996; Bradshaw, 1996, 1997; Mitchell, Bishop, & Minchinton, 1996; Rohwer & Morciniec, 1996, 1997, 1998; Gregorio, 1997; Jørgensen & Linneberg, 1999; Linneberg & Jørgensen, 1999; Wickert & França, 2001; Azhar & Dimond, 2004; Grieco et al., 2010; Carvalho, Carneiro, França, & Lima, 2013). One of the main issues raised by those studies was how to find a means to mitigate memory saturation. As WiSARD was fed with (especially if noisy) data, it started filling up every position of its memory nodes, with its capability of discriminating patterns deteriorating.

Among the studies intended to lessen the effects of saturation, it is worth mentioning the work of Bradshaw (1996), where lower and upper bounds for the VC dimension of WiSARD n-tuple classifier were calculated. No exact value for this measure was found. Bradshaw's (1996) research provided a fertile ground for further analyses on this field. Unfortunately, the solution to the saturation problem proposed there was not that successful, as it relied on convergence and had no guarantee whatsoever that it would actually happen.

Grieco et al. (2010) devised a way of mitigating saturation, called the bleaching technique. It allowed the network to be exposed to loads of data (noisy and free of noise) and yet keep the reliability of its pattern discrimination capability. This improvement considerably enhanced both the accuracy and precision of WiSARD n-tuple classifier applications with no performance harm on their training procedures and just a small bit on the classification step (de Souza, França, & Lima, 2014, 2015; Carneiro et al., 2015; Nascimento et al., 2015; Cardoso et al., 2016).

That technique granted WiSARD competitiveness with trending learning systems by achieving high accuracy with low variance in experimental work. However, there was no theoretical background for bleaching up to this point. This letter aims to provide a mathematical foundation for that extension of the weightless classifier by analyzing the generalization capacity of both basic and bleaching recognition schemes.

The letter is structured as follows. Section 2 introduces some basic definitions of the VC theory and presents WiSARD and bleaching n-tuple classifiers. A formal mathematical definition of both models is provided in section 3. Their VC dimensions are calculated in sections 4 and 5. The results are then discussed in section 6, where some conclusions are drawn and comparisons with weighted learning schemes are made. Section 7 summarizes the work presented and proposes future work.

2  Conceptual Background

This section presents some VC theory basic concepts and a brief introduction to both architectures of the WiSARD model. They provide a cornerstone for the calculations presented in sections 4 and 5.

2.1  VC Dimension Definitions

To measure the capacity of a learning system like the WiSARD network, definitions are required for some basic VC theory concepts. For instance, the VC dimension is intrinsically related to the notion of set shattering, which in turn depends on simpler ideas, like dichotomies.

Let X be a set of data points to be classified and Θ a set of parameterizations of a learning machine L. L classifies data points xX with labels y{-1,1} according to a parameter vector θΘ. In other words, a learning system L can be seen as a function g:X×Θ-1,1.

Definition 1.
Let x1,x2,,xX. The set of dichotomies L can realize on X are defined as
ΔΘX=gx1;θ,gx2;θ,,gx;θ:θΘ.
2.1

That is, the dichotomies are the distinct forms a learning machine can split a set of points into two, each assigned to a different class. The number of dichotomies a learning system L realizes on a set X plays an important role in VC dimension study. The growth function and the concept of shattering derive from that quantity.

Definition 2.

A set X is said to be shattered by a learning system L if card(ΔΘ(X))=2card(X).

In other words, if a learning machine L shatters X, then for every X'X, there is at least one parameterization θΘ of L, such that every xX' is classified as 1 and every other element of X is classified as -1. X is shattered by L if L can realize all possible dichotomies of X.

Definition 3.
The growth function ΠΘ:NN is defined as the maximum number of dichotomies L can implement on samples of size ; that is, the growth function is defined as
ΠΘ()=maxX:card(X)=card(ΔΘ(X)).
2.2

Definitions 2 and 3 imply that the growth function ΠΘ is 2 if there is at least one set X,cardX=, that is shattered by L. Note that if L shatters a set X, it also does every subset X'X. So if ΠΘ=2, then ΠΘk=2k for every k. However, the growth function may have a nonexponential nature if is large enough, so that there is no set L can shatter. The largest value of for which the growth function ΠΘ is 2 is called the VC dimension of learning machine L (Vapnik & Chervonenkis, 1971; Vapnik, 1998; Abu-Mostafa, Magdon-Ismail, & Lin, 2012).

Definition 4.

The VC dimension dVC of learning machine L is defined as the cardinality of the largest set X that is shattered by L. If for every , L shatters a set of such cardinality, then dVCL is infinite.

2.2  The WiSARD Model

WiSARD is an n-tuple classifier with two key features; the splitting of its input into various n-tuples and the storage of its learned knowledge in memory nodes, which are addressed by those n-tuples. This section contemplates the basic WiSARD model and its enhanced version produced by the bleaching technique.

The learning of new patterns is made through simple changes in the content of the system's memory nodes. There is no need for convergence-dependent processes or complex calculations. In other words, n-tuple classifiers have efficient training procedures and so are adequate for empirical studies where there is a large number of hyperparameters to be tuned and massive loads of data to be learned.

2.2.1  The Basic WiSARD n-Tuple Classifier

The n-tuple method was initially proposed by Bledsoe and Browning (1959) and formally defined in Steck (1962). Its most widely known implementation, WiSARD, was made in 1982 (Aleksander et al., 1984). It showed that it was actually possible to assemble the n-tuple classifier.

The WiSARD n-tuple classifier is considered a weightless neural network for the resemblance of its memory nodes to actual neurons. The weightless paradigm is characterized by the storage of the network learned knowledge inside its neuronal nodes, whereas weighted models store it in synaptic connections. WiSARD is also known as a RAM-based neural network because the implementation of its memory nodes was effected by actual random access memories (RAMs). Due to this, the network memory nodes are also known as RAM nodes or simply RAMs.

System architecture. The basic WiSARD n-tuple classifier is a Boolean neural network that receives bit strings as inputs and produces similarity scores associated with target classes. This is made through the use of structures called discriminators. Each discriminator is associated with a single class.

The discriminators have a simple structure: a pseudo-random mapping that shuffles the bits of the network input and splits them into tuples, which address memory nodes. Figure 1a portrays a picture mapped into the series of tuples 10:01:10:10:01:10. An n-tuple classifier is defined according to its tuple length n and the number of memory nodes in each discriminator N.

Figure 1:

Basic WiSARD classifier.

Figure 1:

Basic WiSARD classifier.

In a canonical n-tuple classifier, a same pseudo-random mapping can be used for every discriminator. This way, Nn-tuples address all memory nodes of each discriminator. Figure 1b depicts a canonical n-tuple classifier, where the same series of tuples produced in Figure 1a addresses the memory nodes of both discriminators, D-1 and D1.

The VC dimension is a measure used for systems capable of disambiguating between two classes—functions that partition a set of points into two subsets, each one assigned to a particular class. This letter explores solely n-tuple classifiers whose set of classes C has only two of them, denoted -1 and 1.

Training procedure. Training starts by the network setup, where all memory positions are initialized with 0. Every time a training observation and its corresponding class are sent to the network, the model selects the discriminator of that class and marks every memory position addressed by the observation. Positions marked this way have their stored values set to 1. At the end of the training procedure, every memory position should store a 1 if it was addressed at least once, and 0 otherwise.

Recognition procedure. If a new input pattern is presented to the network, every discriminator responds with a similarity score, representing how close this new pattern is to the learned knowledge stored in the discriminator. The similarity score is defined as the number of memory nodes whose addressed positions store 1, that is, those positions that were accessed at least once during the training step. This score is represented by summation devices Σ-1 and Σ1 in Figure 1b.

The network opts for the discriminator whose score is the highest and assigns its class to the input pattern. If there are at least two discriminators that share the highest response, then the network outputs that a tie happened. In the latter case, the classification system may apply some policy to choose one class or use a draw procedure.

The discriminator similarity score is its number of addressed memory positions. It implies that the number of nodes N (or the size of the addressing tuples n) plays an important role in defining the generalization capability of the WiSARD classifier. High values of N (or small ones of n) reduce the chance that the similarity score will get too low if a newly presented pattern is slightly different by only a few bits from what the network learned. For example, given a network that is presented to a 30-bit input pattern and only one of its RAMs does not recognize it, then (1) if there are 30 RAMs addressed by 1-bit tuples, the similarity score of the classifier will be 29/30; (2) if there are 15 RAMs addressed by 2-bit tuples, the similarity lowers to 14/15; and (3) if there is only one RAM addressed by the whole network input, the similarity will be 0. So the greater is N, the more generalizing is the WiSARD system.

2.2.2  The WiSARD Extended with Bleaching

Basic n-tuple classifiers tend to experience misclassification problems when trained with a large number of data. This occurs because many memory positions should be set to 1 even if they were accessed only once due to a slightly noisy observation. This problem is known as saturation and has been considered one of the major disadvantages of the basic n-tuple classifier since its conception (Bledsoe & Browning, 1959).

Some attempts were made to overcome this drawback (Bledsoe & Bisson, 1962; Ullmann, 1971; Tarling & Rohwer, 1993; Bradshaw, 1996; Azhar & Dimond, 2004). A solution, the bleaching technique, was devised in 2010 (Grieco et al., 2010). Its contribution as a major upgrade to the basic architecture allowed the production of accurate and precise applications (de Souza et al., 2014; Carneiro et al., 2015, 2017; Nascimento et al., 2015; Cardoso et al., 2016).

Training procedure. The bleaching n-tuple classifier training procedure differs from the basic one by the storage of any integer value in the memory positions instead of only 0 and 1. Every element of the memory nodes initializes as 0 during the network setup, as in the binary process. When a position is addressed at training, its stored value is incremented by 1, whereas it would only be set to 1 in the standard procedure, independent of its original value. Summarizing, at the end of the training step, each memory position stores how many times it was accessed.

Recognition procedure. The integer values stored in the memory nodes give the network a better representation of the trained data. However, another structure must be added to the basic architecture for proper classification. The bleaching technique introduces a threshold β (known as the bleaching threshold). It is responsible for defining which memory positions should contribute to the similarity score of the class.

Threshold β is a nonnegative integer number. The canonical bleaching n-tuple classifier employs a single threshold for the entire network. If a memory position is addressed during the classification step, it should be further subjected to β. A node fires 1 if its addressed position stores a value greater than β; otherwise, it fires 0. Therefore, the similarity score of a discriminator of a fixed-threshold bleaching n-tuple classifier is the number of its accessed positions whose stored value is greater than β. Figure 2 portrays the classification of an input pattern by a bleaching n-tuple classifier with β=2.

Figure 2:

Bleaching classifier.

Figure 2:

Bleaching classifier.

In practical applications, however, a dynamic threshold is employed instead. The threshold is initialized as β=0 at the network setup. Thus, at first glance, the dynamic-threshold bleaching n-tuple classifier works like the basic classifier, where the score of a discriminator is defined as the number of accessed positions whose stored value is greater than 0. If a selection criterion for a given class is not satisfied at the classification of a pattern, β is incremented by 1 and a new score is calculated. The iterative procedure continues until that criterion is satisfied. The selection criterion of a two-class bleaching n-tuple classifier consists of having a single discriminator whose score is greater than that of its opposing class. The iterative procedure of incrementing β continues until there is no longer a tie between the discriminator scores.

3  Mathematical Formulation

The basic and bleaching n-tuple classifiers are defined according to N, their number of RAM-nodes, and n, the address length of these nodes. Every notation involving those models is derived from these two parameters. This section is split in three parts; one concerns the procedure of encoding the input into n-tuples, and the other two are related to the models themselves. In each part, a mathematical notation is provided and an example is offered for clarity.

3.1  Input and Addresses

Both models treat a given input in an identical way, shuffling and splitting it into N addressing n-tuples. When this procedure is put in mathematical form, it is seen as a two-part process, in which an input vector xk0,1Nn is first transformed into a tuple matrix Tk0,1N×n and then into a feature matrix Zk0,1N×2n.

3.1.1  Notation

Let X be an ordered set of input data points, X=x1,x2,. An input xkX must be such that xk0,1Nn for it to be read by the n-tuple classifier. Also, let y be a generic vector of classes yk{-1,1} with as many elements as X, so that there is a specific correspondence between every kth input data point xk and the kth element of y.

Given a permutation π of Nn elements, π:1,,Nn1,,Nn, let Tk0,1N×n be an N×n tuple matrix obtained from input data point xk through a bijective map, such that every element tk,i,j of Tk is equal to the πi-1n+jth element of xk. This procedure characterizes the pseudo-random mapping of the WiSARD model, in which an input is shuffled and then reshaped as an ordered set of Nn-digit binary addresses, here represented by an N×n binary matrix.

Let Z be an ordered set of feature matrices, Z=Z1,Z2,. A feature matrix Zk0,1N×2n is obtained from tuple matrix Tk0,1N×n via an addressing function A:0,1N×n0,1N×2n. A is a row-wise function; it performs the same operation a· for every row tk,iTk, resulting in a corresponding row zk,iZk. a· produces an indicator vector, such that every element zk,i,j of zk,i is defined according to
zk,i,j=1,ifj=1+m=1ntk,i,m·2n-m0,otherwise,
3.1
where tk,i,m is the mth element of n-tuple tk,i. That is, each column j of Zk is assigned to a particular n-tuple, and its cells zk,i,j are set to 1 if tk,i is the n-tuple associated with column j and set to 0 otherwise.

It is worth noting that the resulting matrix Zk has a single element per row equal to 1 and the remaining ones to 0. This is a direct consequence of the fact that each n-tuple of Tk addresses one and only one memory position.

For readability, throughout this letter, data input point xk and any of its derived matrices may appear as tk,1:tk,2::tk,N, where tk,i is the n-tuple represented by the ith row of tuple matrix Tk, which addresses memory nodes d-1,i and d1,i. These n-tuples can be 0n, an all-zero n-tuple, 1n, an all-one one, or 0,1n, a generic binary one. The notation introduced here is provided in Table 1.

Table 1:
Input and Addressing Notation.
X Ordered set of input data points 
xk kth element of X 
y Vector of classes–one for each element of X 
yk kth element of y, expected class of xk 
π Permutation that characterizes the input mapping 
Tk kth element of T; tuple matrix associated with input xk 
tk,1::tk,N n-tuple representation of xk 
tk,i ith n-tuple of xk, which addresses dc,i
 ith row of Tk 
Z Ordered set of feature matrices 
Zk kth element of Z; feature matrix associated with input xk 
zk,i ith row of Zk 
zk,i,j jth element of zk,i 
A Addressing function 
0,1n Generic binary n-tuple 
0n All-zero n-tuple 
1n All-one n-tuple 
X Ordered set of input data points 
xk kth element of X 
y Vector of classes–one for each element of X 
yk kth element of y, expected class of xk 
π Permutation that characterizes the input mapping 
Tk kth element of T; tuple matrix associated with input xk 
tk,1::tk,N n-tuple representation of xk 
tk,i ith n-tuple of xk, which addresses dc,i
 ith row of Tk 
Z Ordered set of feature matrices 
Zk kth element of Z; feature matrix associated with input xk 
zk,i ith row of Zk 
zk,i,j jth element of zk,i 
A Addressing function 
0,1n Generic binary n-tuple 
0n All-zero n-tuple 
1n All-one n-tuple 

3.1.2  Example

Figure 1a depicts the mapping of a 12-bit input (represented in a 4×3 grid) to six n-tuples of length 2. This mapping is done so that the given input can be introduced to an n-tuple classifier with N=6 memory nodes in each discriminator, which are addressed by n-tuples of length n=2, such as those in Figures 1b and 2.

The input in the retina of Figure 1a can be written in an n-tuple representation as xk=10:01:10:10:01:10, which can be also be displayed as a tuple matrix:
Tk=100110100110.
3.2
Tk, in turn, is transformed into a feature matrix,
Zk=001001000010001001000010,
3.3
through the use of addressing function A. By assigning the first column of Zk to address 00, the second one to 01, the third to 10, and the last to 11, one can verify that every n-tuple 10 (the first, third, fourth, and last rows of Tk) is represented by matrix row 0010. The remaining rows of Zk represent n-tuples 01.

3.2  The Basic WiSARD n-Tuple Classifier

The mathematical formulation of the basic WiSARD n-tuple classifier focuses on the representation of its discriminators and how to characterize the WiSARD model as a function that receives a feature matrix Zk and outputs a class c-1,1.

3.2.1  Notation

Let WN,n be the family of functions that denotes every two-class n-tuple classifier defined by parameters N and n (W stands for WiSARD). Also let WN,n·;M-1,M1WN,n be a function representing an n-tuple classifier instance defined as above, whose learned knowledge on any class c is characterized by memory matrix Mc0,1N×2n. Every network instance WN,n·;M-1,M1 has two discriminators, denoted D-1 and D1. They are, respectively, represented by M-1 and M1, the memory matrices that store their learned knowledge. Mc is defined as Mc=mc,i,jN×2n, where mc,i,j0,1.

For any feature matrix Zk mapped from an input xkX, WN,n(Zk;M-1,M1) produces a score vector sZk=s-1Zk,s1ZkT, where scZk0,1,,N is the score of Dc, defined as
scZk=i=1Nj=12nmc,i,jzk,i,j,
3.4
where zk,i,j is the element at the ith row and jth column of Zk. Given the score vector s, function WN,nZk;M-1,M1 returns the corresponding class as
WN,nZk;M-1,M1=argmaxc-1,1scZk.
3.5

The notation introduced here is condensed in Table 2.

Table 2:
Notation for the Basic n-Tuple Classifier.
N Number of nodes 
n Addressing tuple length 
WN,n N-node n-tuple classifying model 
c Generic class (-1 or 1) 
M-1 Memory matrix of class -1 
M1 Memory matrix of class 1 
Mc Memory matrix of class c 
mc,i,j Element at ith row and jth column of Mc 
WN,n·;M-1,M1 N-node n-tuple classifier with memory 
 matrices M-1 and M1 
D-1 Discriminator for class -1 
D1 Discriminator for class 1 
Dc Discriminator for class c 
dc,i ith memory node of Dc 
s-1Zk Score of D-1 for Zk 
s1Zk Score of D1 for Zk 
sZk Score vector sZk=s-1Zk,s1ZkT 
N Number of nodes 
n Addressing tuple length 
WN,n N-node n-tuple classifying model 
c Generic class (-1 or 1) 
M-1 Memory matrix of class -1 
M1 Memory matrix of class 1 
Mc Memory matrix of class c 
mc,i,j Element at ith row and jth column of Mc 
WN,n·;M-1,M1 N-node n-tuple classifier with memory 
 matrices M-1 and M1 
D-1 Discriminator for class -1 
D1 Discriminator for class 1 
Dc Discriminator for class c 
dc,i ith memory node of Dc 
s-1Zk Score of D-1 for Zk 
s1Zk Score of D1 for Zk 
sZk Score vector sZk=s-1Zk,s1ZkT 

3.2.2  Example

The WiSARD of Figure 1b is characterized by its number of RAM nodes, N=6; their address length, n=2; and the contents of its two discriminators, D-1 and D1, which are mathematically represented by memory matrices
M-1=010000100100001000100001
3.6
and
M1=001001000010001001000010,
3.7
respectively. Each row of these matrices corresponds to the content of its respective RAM node. For instance, the first row of M1 is 0010, which is exactly the same sequence of values in the memory positions of d1,1.

The n-tuple classifier of Figure 1b produces scores s-1Zk=1 and s1Zk=6, which are the outputs of D-1 and D1. Because the chosen class is the one whose score is the highest, that n-tuple classifier should opt for 1 as the class to be applied to input xk=10:01:10:10:01:10.

3.3  The WiSARD n-Tuple Classifier with Bleaching

The mathematical formulation of the WiSARD n-tuple classifier with bleaching focuses on the same concerns as its basic counterpart. The introduction of bleaching threshold β leads to a noticeable change in the definition of the score measure of this classifier.

3.3.1  Notation

Let BN,n be the family of functions that denotes every two-class bleaching n-tuple classifier defined by parameters N and n (B stands for bleaching). Let BN,n·;M-1,M1BN,n be a function representing a bleaching n-tuple classifier instance, whose learned knowledge on any class c is characterized by memory matrix McNN×2n. Like its binary counterpart, every network instance BN,n·;M-1,M1 has two discriminators, denoted D-1 and D1. They are, respectively, represented by M-1 and M1, the memory matrices that store their learned knowledge. Mc is defined as Mc=[mc,i,j]N×2n, where mc,i,jN.

Let sβ(Zk)=[s-1β(Zk),s1β(Zk)]T be the score vector of the bleaching n-tuple classifier for a given fixed threshold β. The scores scβ0,1,,N represent the number of nodes of discriminator Dc whose addressed positions store values greater than β. So they can be defined as
scβ=i=1Nj=12n1β,mc,i,jzk,i,j,
3.8
where 1Sx is the indicator function that returns 1 if xS and 0 otherwise. mc,i,j and zk,i,j are, respectively, elements of matrices Mc and Zk.
For any feature matrix Zk mapped from an input xk, BN,n returns the corresponding class as
BN,nZk;M-1,M1=argmaxc-1,1scβZk,
3.9
where β is the smallest threshold that satisfies s-1βZks1βZk. In other words, the chosen class is the one whose score is the highest, given a bleaching threshold β that is the lowest one for which there is no tie between the discriminators. Similar to the basic model, if there is no such threshold, the classification system may apply some policy to choose one class or use a draw procedure. The notation introduced here is condensed in Table 3.
Table 3:
Notation for Bleaching n-Tuple Classifier.
BN,n Bleaching N-node n-tuple classifying model 
BN,n·;M-1,M1 Bleaching N-node n-tuple classifier with 
 memory matrices M-1 and M1 
β Bleaching threshold 
s-1βZk Score of D-1 for Zk and threshold β 
s1βZk Score of D1 for Zk and threshold β 
sβZk Score vector sβZk=s-1βZk,s1βZkT 
BN,n Bleaching N-node n-tuple classifying model 
BN,n·;M-1,M1 Bleaching N-node n-tuple classifier with 
 memory matrices M-1 and M1 
β Bleaching threshold 
s-1βZk Score of D-1 for Zk and threshold β 
s1βZk Score of D1 for Zk and threshold β 
sβZk Score vector sβZk=s-1βZk,s1βZkT 

3.3.2  Example

Figure 2 presents a bleaching n-tuple classifier with two discriminators, D-1 and D1. They are mathematically represented by memory matrices
M-1=020100300210021000300201
3.10
and
M1=002102010030003002010021,
3.11
respectively. One can note that the same relation between memory matrix rows and RAM node contents mentioned in section 3.2.1 applies to the bleaching variants of M-1 and M1. Figure 2 shows the scores produced by the bleaching n-tuple classifier when subjected to a bleaching threshold β=2. In this case, the learning machine produces scores s-12Zk=0 and s12Zk=2.

4  The VC Dimension of the Basic WiSARD n-Tuple Classifier

Studies on the VC dimension of the basic WiSARD n-tuple classifier were first conducted by Bradshaw (1996). The work intended to find generalization bounds on this learning system as a means to allow potential comparisons with other machine learning models, as well as to find ways to improve its generalization and mitigate the saturation problem.

Bradshaw (1996) achieved an exact value for the VC dimension of the maximal discriminator n-tuple classifier, that is, a network with a single discriminator, which accepts an input if its score is maximum and rejects it otherwise. Bradshaw (1996) found dVC=N2n-1 for this model, where dVC is its VC dimension, and N and n are the number of nodes and the addressing tuple length, respectively.

No exact value was found for the VC dimension of the two-discriminator n-tuple classifier. Bradshaw's (1996) studies fixed lower and upper bounds for this dimension, asserting that it should be N2n-1dVClog23·N2n. Finally, Bradshaw (1996) suggested as a conjecture that an exact value for the VC dimension of the two-discriminator architecture is attainable and that it is dVC=N2n-1.

It is important to raise some points before entering the proof itself. First, only n-tuple classifiers with two discriminators are considered, despite the multiclass potential of the model. This is done because the VC dimension is defined on the idea of dichotomies (see section 2.1). Second, the addressing function is pseudo-randomly generated at the network setup, and it is not changed afterward. Inserting new knowledge into the network does not affect its pseudo-random mapping, so mapping function does not count as an effective parameter of the n-tuple classifying learning model, not playing any role on its learning capacity. This way, every time an input observation should be considered, its equivalent feature matrix is used instead. It brings the benefit of easing the comprehension of the proof without jeopardizing it. Third, an arbitrary class must be chosen in the case of a draw. This work uses 1 as the opted class. Finally, we next present a sketch of the proof.

4.1  Sketch of the Proof

The proof of the VC dimension is divided in two parts: the proof of its lower bound and the proof of its upper bound. The flowchart in Figure 3 follows the idea of the proof. To prove the first part, a set of points is defined, such that its cardinality is the intended lower bound. Then it is shown that for each possible class attribution, there is one n-tuple classifier capable of classifying that set of points accordingly. This proves that the n-tuple classifying model can shatter that set of points, concluding this part of the proof. To prove the upper bound, a system of linear inequalities is produced. This system is based on the score measure of equation 3.4. It should represent the classification of a generic set of points into a generic set of classes of same cardinality, higher than the intended upper bound. Next, it is proved that there is no solution to such a system of inequalities. There is thus at least one set of classes in which there is no way to classify a set of points that large, and this set is not shattered. This concludes the proof of the upper bound. Given that the lower and upper bounds are identical, the VC dimension has an exact value.

Figure 3:

Proof flowchart for determination of the VC dimension of the basic WiSARD n-tuple classifier.

Figure 3:

Proof flowchart for determination of the VC dimension of the basic WiSARD n-tuple classifier.

4.2  Lower Bound

As explained in section 4.1, the proof of the lower bound of the VC dimension is attained by construction. Given an ordered set of points X=x1,x2,, with every point xk having an associated feature matrix Zk, for every class vector y=[y1,y2,]T of the same cardinality as X, there must exist an n-tuple classifier WN,n·;M-1,M1WN,n, such that WN,nZk;M-1,M1=yk for all k. X must be such that its cardinality is equal to the intended lower bound for the VC dimension.

Lemma 1.

The VC dimension, dVC, of a model WN,n, as defined in section 3.2.1, is bounded below by N2n-1+1.

Proof.
Let X be an ordered set of input data points defined as
X=0,1n:0n::0nN0n:0,1n::0n0n:0n::0,1n
4.1
and Z its corresponding set of feature matrices.

It is worth noting that X is the set of every input point with at most one non-null addressing n-tuple. Its first line refers to all input points whose first addressing n-tuple can be any binary sequence, but the remaining ones must be all-zero n-tuples. Its second line refers to the input points whose second n-tuple can be any sequence of binary values and the remaining ones must be composed only of zeros. The same pattern applies to every other line of X.

The main intuition behind the choice for X lies in the fact that it is the largest set of input points that does not allow an undesirable property. It does not have a subset of four points, where each of them differs from two of the other points by a single n-tuple—for example, the set composed by 00:00, 01:00, 01:01, and 00:01. Such a set of points, or any other that contains it, cannot be shattered. There is no WiSARD n-tuple classifier that can classify 00:00 and 01:01 as 1 and 01:00 and 00:01 as -1.

For clarity, it is assumed, without loss of generality, that x1, the first element of X, is 0n:0n::0n, and Z1 is its corresponding feature matrix. This way, the matching class of this input should be y1. To attest if WN,n shatters X, one should, for every class vector y, be capable of building an n-tuple classifier WN,n·;M-1,M1WN,n, such that WN,nZk;M-1,M1=yk for all k. In order to do it, this procedure is done in two cases, each one related to a distinct set of class vectors: case I, the class vectors in which y1=1, and case II, those in which y1=-1. In other words, in case I it is attested if WN,n can realize half of all possible dichotomies in X, and in case II if WN,n can realize the other half. It is worth noting that the union of both cases comprises all 2cardX possible dichotomies that can be realized in set X, and so if WN,n can realize both sets of dichotomies in X, then WN,n shatters X.

Case I,y1=1: Given a class vector y=[1,y2,y3,]T, let M-1 and M1 be memory matrices, whose elements mc,i,j are given by1
mc,i,j=1,iftk,i0naddressesthejthpositionofdc,iandyk=c0,otherwise,
4.2
and let there be an n-tuple classifier WN,n·;M-1,M1WN,n whose memory matrices are defined according to equation 4.2.

If an input xkX0n:0n::0n is presented to the classifier, it yields a score vector sZk=[1,0]T if yk=-1, or sZk=[0,1]T if yk=1. This happens because only one non-null position is addressed. If WN,n·;M-1,M1 receives x1=0n:0n::0n, its score vector is sZ1=[0,0]T, which leads to a draw, making the network opt for class 1 as default. In short, for any class vector y=[1,y2,y3,]T, there exists a basic n-tuple classifier WN,n·;M-1,M1, such that it accurately classifies any xkX0n:0n::0n as yk and also classifies x1=0n:0n::0n as y1=1.

Case II,y1=-1: Given a class vector y=[-1,y2,y3,]T, let M-1 and M1 be memory matrices, whose elements mc,i,j are given by
mc,i,j=1,iftk,i0naddressesthejthpositionofdc,iandyk=c,orifc,i,j=-1,1,10,otherwise,
4.3
where c,i,j=-1,1,1 represents the memory position of d-1,1 that is addressed by 0n. Let there also be an n-tuple classifier WN,n·;M-1,M1WN,n whose memory matrices are defined according to equation 4.3.

Equation 4.3 differs from equation 4.2 in a single element. This difference alters the score vector s by adding 1 to s-1 if the first n-tuple of an input xk is 0n. In other words, if an input xk{{0,1}n:0n::0n}{0n:0n::0n} is presented to the classifier, it yields the same score vectors it did in case I: sZk=1,0T if yk=-1, or sZk=0,1T if yk=1. For the other inputs, the score vector changes as follows. In case I, when the classifier received x1=0n:0n::0n, it yielded score vector sZ1=0,0T, leading to a draw. In case II, no draw happens because the classifier outputs score vector sZ1=1,0T, favoring class -1. The remaining possible inputs are those of type xkX{{0,1}n:0n::0n}. When subjected to these inputs, the classifier in case I produced score vectors sZk=1,0T if yk=-1 and sZk=0,1T if yk=1. Because the first n-tuple of xk is 0n, the classifier in case II generates sZk=2,0T if yk=-1, and sZk=1,1T if yk=1. The former favors its corresponding class because s-1Zk>s1Zk. In the latter, a tie happens, and as default, the network opts for class 1. In short, for any class vector y=[-1,y2,]T, there exists a basic n-tuple classifier WN,n·;M-1,M1, such that (1) it accurately classifies any xk{{0,1}n:0n::0n}{0n:0n::0n} as yk, (2) it does the same for xkX{{0,1}n:0n::0n}, and (3) it classifies x1=0n:0n::0n as y1=-1.

In summary, for every class vector y, there exists an n-tuple classifier WN,n·;M-1,M1WN,n, such that WN,nZk;M-1,M1=yk for all k. That is, WN,n realizes all 2cardX dichotomies of X. So WN,n shatters X, and the VC dimension of this model is bounded below by cardX=N2n-N-1=N2n-1+1.

As a graphical representation of this proof, we offer examples for each particular case (y1=1 and y1=-1). The examples describe dichotomies that a W2,2n-tuple classifying model (N=2 and n=2) realizes on a set of input points. Applying N=2 and n=2 to set X, displayed in equation 4.1, one gets
X2,2={00:00,01:00,10:00,11:00,00:01,00:10,00:11}.
4.4
Two of the 27 dichotomies W2,2 realizes on X2,2 are presented, one for each particular case. In both chosen dichotomies, 10:00, 00:10 and 00:11 are classified as 1 and 01:00, 11:00 and 00:01 as -1. Input point 00:00 (the 2-node 2-tuple representation of input x1=0n:0n::0n) is classified as y1, which is 1 in case I and -1 in case II. These dichotomies are set out in Table 4.
Table 4:
Dichotomies of X2,2.
Classified as 
-1 y1 
10:00 01:00  
00:10 11:00 00:00 
00:11 00:01  
Classified as 
-1 y1 
10:00 01:00  
00:10 11:00 00:00 
00:11 00:01  

The dichotomies displayed in Table 4 imply the WiSARD configurations of Figures 4a and 4b, which are produced by equations 4.2 and 4.3, respectively. Because the corresponding class of inputs 10:00, 00:10 and 00:11 is 1, position 10 of d1,1 and positions 10 and 11 of d1,2 store the value 1 in both figures. Analogously, inputs 01:00, 11:00, and 00:01 are associated with class -1, and so positions 01 and 11 of d-1,1 and position 01 of d-1,2 also store 1 in both figures. Finally, in Figure 4b, the content of position 00 of d-1,1 is also 1, due to equation 4.3.

Figure 4:

WiSARD configurations.

Figure 4:

WiSARD configurations.

Five inputs are provided for this example, 00:00, 01:00, 10:00, 00:01, and 00:10. The first one is the 2-node 2-tuple representation of x1=0n:0n::0n. The classifier in case I (depicted in Figure 4a) outputs score vector sZ1=0,0T, and that in case II (displayed in Figure 4b) outputs sZ1=1,0T. The generation of those score vectors is illustrated in Figure 5. The second and third inputs are of type xk{{0,1}n:0n::0n}{0n:0n::0n}, respectively classified as -1 and 1. Both classifiers would produce the score vector sZk=1,0T for 01:00 and sZk=0,1T for 10:00 (see Figures 6 and 7). The last two examples are inputs of type xkX{{0,1}n:0n::0n}, also respectively classified as -1 and 1. In case I, the classification of these inputs lead to the same score vectors yielded from the classification of 01:00 (see Figures 6a and 8a) and 10:00 (see Figures 7a and 9a). In case II, the score vectors generated this way should be sZk=2,0T for 00:01 and sZk=1,1T for 00:10, as can be checked in Figures 8b and 9b, respectively.

Figure 5:

Classification of 00:00.

Figure 5:

Classification of 00:00.

Figure 6:

Classification of 01:00.

Figure 6:

Classification of 01:00.

Figure 7:

Classification of 10:00.

Figure 7:

Classification of 10:00.

Figure 8:

Classification of 00:01.

Figure 8:

Classification of 00:01.

Figure 9:

Classification of 00:10.

Figure 9:

Classification of 00:10.

4.3  Upper Bound

The VC dimension upper bound of an n-tuple classifying model is obtained through a linear algebra approach. In section 4.1, we advanced that a system of linear inequalities should be derived from the score formula, presented in equation 3.4. The relation between the values of scores s-1Zk and s1Zk plays a vital role in determining the suitable class for a given input, and this is the main intuition behind the development of that system of inequalities.

Proposition 1.
An n-tuple classifying model WN,n shatters a set of feature matrices Z,cardZ=, if and only if given a generic class vector y{-1,1}, there exists a set of memory matrix elements mc,i,j{0,1} that solves
i=1N1-j=22nzk,i,jm1,i,1-m-1,i,1+i=1Nj=22nzk,i,jm1,i,j-m-1,i,j=ykμk-12,
4.5
for all k{1,,} and some μk12,32,,2N+12, where zk,i,j and yk are defined as in section 3.1.1.
Proof.
Let νk be the difference between scores s1Zk and s-1Zk, that is,
s1Zk-s-1Zk=νk,i=1Nj=12nzk,i,jm1,i,j-m-1,i,j=νk.
4.6
Because both zk,i,j and mc,i,j can be either 1 or 0, each addend of equation 4.6 can be -1, 0, or 1. But it is worth reminding that for any i, zk,i,j=1 for exactly one j and is otherwise 0. As a consequence, νk can assume any integer value in the interval -N,N. If it is negative, the classifier should choose class -1. Otherwise, it should choose 1 (even if there is a draw). Also, given the property concerning the elements zk,i,jZk, it is straightforward that
zk,i,q=1-j=1jq2nzk,i,j.
4.7
In order to represent νk as a function of label yk, variable μk12,32,,2N+12 is introduced. This way,
νk=ykμk-12.
4.8
This way, one gets the difference of the scores as a linear function of yk. This can be checked by verifying that when yk=1, νk{0,1,,N}, that is, any possible nonnegative score difference between discriminators with N memory nodes. It comprises the case of a tie between discriminators (νk=0) up to the case where every node of D1 fires and none of D-1 does (νk=N). When yk=-1, νk{-1,-2,,-N-1}{-1,-2,,-N}, that is, any possible negative score difference between discriminators with N memory nodes. It comprises all possible cases where the score of D-1 is greater than that of D1, up to the case where every node of D-1 fires and none of D1 does (νk=-N). Although identity 4.8 allows νk to be -N-1, it does not truly occur. It poses no problem, nevertheless, as every score difference νk can be successfully represented as a linear function of its respective class yk, as it was intended.
Combining equations 4.6, 4.7 (for q=1 without loss of generality), and 4.8,
i=1N1-j=22nzk,i,jm1,i,1-m-1,i,1+i=1Nj=22nzk,i,jm1,i,j-m-1,i,j=ykμk-12.
4.9

So M-1 and M1 must be such that equation 4.9 holds true for an n-tuple classifier WN,n·;M-1,M1 to correctly categorize a feature matrix Zk. As an immediate consequence, if there exist M-1 and M1 such that equation 4.9 holds true for all k{1,,}, then WN,n·;M-1,M1 accurately assigns every feature matrix ZkZ to its corresponding class yky, because the score difference will agree in sign with yk, regardless of the value of μk.

Finally, if those same memory matrices satisfy equation 4.9 for all k{1,,} and for any generic class vector y{-1,1}, then for all 2 possible instances of y, WN,n·;M-1,M1 can precisely classify every ZkZ. In other words, model WN,n shatters Z.

Next, it is intended to show that if the number of equations of the existing system (see equation 4.5) is larger than a given upper bound, then there are no memory matrices M-1 and M1 that satisfy it for every possible set of classes y{-1,1}.

Proposition 2.

There exist no memory matrices M-1 and M1 for which the system of equations 4.5 holds true for a generic set of classes y{-1,1} if the number of equations >N2n-1+1.

Proof.

The system of equations 4.5 is linear in mc,i,j, and so it can be expressed in the form Ax=b. According to Rouché-Capelli theorem (Meyer, 2000), a system of linear equations has a solution if and only if the rank of its coefficient matrix A is equal to the rank of its augmented matrix Ab. Because many columns of the coefficient matrix of system 4.5 can be produced through linear combinations of others, eliminations must be done to determine its rank.

At first, the system of equations has N2n+1 variables, and so the rank of its coefficient matrix is at most that same value. The coefficients of m-1,i,j and m1,i,j are, respectively, -zk,i,j and zk,i,j when j2 and -1-j=22nzk,i,j and 1-j=22nzk,i,j when j=1. So there is a linear dependency between the coefficients of m-1,i,j and m1,i,j, and then the coefficients of all variables m-1,i,j can be eliminated through a linear combination to those of m1,i,j. This way, N2n columns are linearly dependent on others (one column for every pair i,j, 1iN and 1j2n).

Further eliminations can be done. For every i2, the coefficients of m1,i,1zk,i,1=1-j=22nzk,i,j can be represented as a linear combination of those of m1,1,1zk,1,1=1-j=22nzk,1,j and those of m1,1,j (zk,1,j) and m1,i,j (zk,i,j) for every j2. This can be verified in
1-j=22nzk,i,j=1-j=22nzk,1,j+j=22nzk,1,j-j=22nzk,i,j.
4.10
This procedure determines that there is one column that is linearly dependent on others for every i2. This way, N-1 columns are linearly dependent on others and can be eliminated.

In summary, two significant elimination processes were made to determine that there are at most N2n+1-N2n-N-1=N2n-1+1 linearly independent columns. Therefore, when the system of equations 4.5 has more than N2n-1+1 equations, the rank of its coefficient matrix is at most N2n-1+1. It is worth noting that no further linear relations could be retrieved, because it would drive the VC dimension below the lower bound determined by Lemma 5. In other words, no n-tuple classifier like the one assembled in that lemma could exist, which is clearly impossible.

Because there is no prior linear relation between class vector y and the ordered set of feature matrices Z, the rank of the augmented matrix is greater than that of the coefficient matrix if their corresponding system has more than N2n-1+1 equations. Hence, there is no solution for such a system of linear equations.

Due to the results of propositions 6 and 7, it is straightforward to determine an upper bound for the VC dimension of the basic n-tuple classifier:

Lemma 2.

The VC dimension, dVC, of a model WN,n, as defined in section 3.2.1, is bounded above by N2n-1+1.

Proof.

There is no solution for the system of equations 4.5 if it has at least N2n-1+2 equations. It implies that WN,n does not shatter any set of feature matrices Z, cardZ=N2n-1+2.

Theorem 1.

The VC dimension, dVC, of a model WN,n, as defined in section 3.2.1, is given by N2n-1+1.

Proof.

The VC dimension of WN,n is bounded above and below by N2n-1+1. Then, it is exactly N2n-1+1.

5  The VC Dimension of the WiSARD with Bleaching

The bleaching technique provided an advantage to the n-tuple classifier by mitigating its proneness to saturation. Results on experiments conducted with this architecture indicated a noticeable improvement to what could be achieved by the basic learning model (Grieco et al., 2010; de Souza et al., 2014, 2015; Carneiro et al., 2015; Nascimento et al., 2015; Cardoso et al., 2016). Historically, the n-tuple classifier accuracy tended to worsen if the amount of data to be learned by the system was large enough. The capability of learning from massive data without getting saturated showed that the classifier could achieve higher accuracies (with lower variance) as more data were presented to it. The model, however, lacked a theoretical foundation. A study on its generalization capacity could bring a better comprehension on how the network works and also could provide insights into how it could be further improved.

The proof of the VC dimension of the bleaching n-tuple classifier is divided in two parts, the proof of its lower bound and the proof of its upper bound, as depicted in Figure 10. The lower bound is immediately obtained from the fact that the bleaching n-tuple classifier is a generalization of its basic counterpart. The discriminators of the bleaching n-tuple classifier yield scores that depend on a given threshold β. The proof of the upper bound introduces a new score measure, which does not depend on a bleaching threshold. After it is properly defined, it is important to prove that this score really does reproduce the intended outcome of the class selection criterion of the two-class bleaching n-tuple classifier, as introduced in section 2.2.2.

Figure 10:

Proof flowchart for determination of the VC dimension of bleaching n-tuple classifier.

Figure 10:

Proof flowchart for determination of the VC dimension of bleaching n-tuple classifier.

The use of this score measure leads to a system of linear equations, similar to the one of proposition 6. By a procedure similar to that of proposition 7, one deduces that the system of equations has no solution if it has more equations than the intended upper bound of the VC dimension. Therefore, this upper bound is attained. Again the lower and upper bounds are the same, and so the VC dimension of the bleaching n-tuple classifier has an exact value.

The proof of the VC dimension of the bleaching n-tuple classifying model starts by proving its lower bound:

Corollary 1.

The VC dimension, dVC, of a model BN,n, as defined in section 3.3.1, is bounded below by N2n-1+1.

Proof.

The bleaching n-tuple classifier is a generalization of the basic WiSARD, for the former works exactly like the latter if its memory matrices store only binary values. Consequently, dVC is bounded below by N2n-1+1, the VC dimension of the basic n-tuple classifier as indicated by theorem 9.

The proof of the upper bound of the VC dimension of the bleaching learning model needs the introduction of a score measure that does not depend on the bleaching threshold β for the reasons exposed in the beginning of this section.

Definition 5.

Let Λ-1=[λ-1,i,j]N×2n and Λ1=[λ1,i,j]N×2n be matrices with the same dimensions as M-1 and M1, respectively. For a given class c{-1,1}, the elements of Λc are defined as λc,i,j=-N-mc,i,j.

Definition 6.
Let sZk=s-1Zk,s1ZkT be a β-independent score vector. For a given class c{-1,1},
scZk=i=1Nj=12nλc,i,jzk,i,j=-i=1Nj=12nN-mc,i,jzk,i,j.
5.1
Corollary 2.
BN,nZk;M-1,M1=argmaxc{-1,1}scZk.
5.2
Proof.

Let v-1,i and v1,iN respectively be the values of the accessed position of ith node of D-1 and D1, hereafter denoted as node scores. Suppose, without loss of generality, that v-1,iv-1,j and v1,iv1,j if i<j, that is, nodes of both discriminators are sorted so that the highest values of the accessed positions correspond to the ones with lowest indices.

Suppose also, without loss of generality, that BN,nZk;M-1,M1=1, that is, 1 was the class chosen by the classifier. According to the mathematical formulation given in section 3.3.1, there is a bleaching threshold β that is the smallest one that satisfies s1βZk>s-1βZk.

For a bleaching n-tuple classifier to produce such scores, there must exist an N'N, such that the learning machine has the following node scores, that is, (1) v1,iβ+1 and v-1,iβ, for every i<N'; (2) v1,N'β+1; (3) v-1,N'=β; and (4) v1,i=v-1,iβ, for every i>N'. A classifier with these node scores opts for class 1 (as intended), because β is the smallest threshold for which there is a tie break, with s1βZk=N' and s-1βZkN'-1.

Because it was supposed, without loss of generality, that the classifier would select class 1, then β-independent score s1Zk must be greater than s-1Zk for the corollary statement to be proven true. As
s1Zk-i=N'+1NN-v1,i-N'N-β-1,
5.3
s-1Zk<-i=N'+1NN-v-1,i-N-β
5.4
and
v1,i=v-1,iβ,foreveryi>N',
5.5
then
s1Zk-s-1Zk>-N'N-β-1+N-β-N-β+N-β=0.
5.6
That is, s1Zk>s-1Zk.

Supposing the selected class to be -1 would lead to an analogous proof that would result in s-1Zk<s1Zk, as expected. Finally, for a draw scenario, every node score v1,i should be equal to v-1,i, for every i. This way, β-independent scores s1Zk and s-1Zk would also be equal.

Summarizing, when s1Zk>s-1Zk, the network opts for class 1; if s-1Zk>s1Zk, it chooses -1; and if both β-independent scores are identical, the classifier outputs that there is a draw. Thus, it confirms the identity proposed by this corollary.

The use of β-independent scores makes the function BN,n·;M-1,M1 equivalent to WN,n·;Λ-1,Λ1 (compare definition 12 and corollary 2 to equations 3.4 and 3.5). Consequently, a system of linear equations with similar characteristics to equation 4.5 can be derived from the relation between the β-independent scores, as noted at the beginning of this section.

Lemma 3.

The VC dimension, dVC, of a model BN,n, as defined in section 3.3.1, is bounded above by N2n-1+1.

Proof.

Analogous to proposition 6, the formulas of β-independent score measures s-1· and s1· lead to a system of linear equations, whose coefficient matrix is the same. The column that represents the right-hand side of the equations is also linearly independent of those of the coefficient matrix. Therefore, such a system lacks a solution if it has at least N2n-1+2 equations. Then, BN,n does not shatter any set of feature matrices Z,cardZ=N2n-1+2.

Theorem 2.

The VC dimension, dVC, of a model BN,n, as defined in section 3.3.1, is given by N2n-1+1.

Proof.

The VC dimension of BN,n is bounded above and below by N2n-1+1. Then it is exactly N2n-1+1.

6  Discussion

Even an exact VC dimension for WiSARD does not provide a good rule of thumb to define N and n. Empirical results from applications of the WiSARD n-tuple classifier strongly suggest that the data set size could be used to further the analysis of WiSARD learning capacity, even though the model does not require regularization. Another related topic is that of the sparsity of the data set elements.

Calculations on the VC dimensions of WiSARD and bleaching n-tuple classifiers show that their values are the same, indicating that the bleaching technique mitigates the saturation and does little to no harm to the network generalization capacity. Their values, however, are higher than what would be expected from previous experimental work. It is worth noting that Bradshaw's (1996) conjecture was very close to the exact value found in this work. For instance, the WiSARD network employed in Carneiro et al. (2017) has discriminators with more than 95 nodes, each one addressed by n-tuples of length 88. The VC dimension of this learning machine is approximately 3·1028, which is far greater than the size of the largest data set it was subjected to (fewer than 1.3·106 elements). The determination of the VC dimension of basic and bleaching n-tuple classifiers was performed for dichotomous classifications. Despite the polychotomous nature of part-of-speech tagging, the analysis here given provides insight into the learning capacity of those learning machines in a practical application. Yet, classically in the literature, multiclass classification is often executed through algorithms that use dichotomous classifications, such as multiclass support vector machines (SVMs) (Hsu & Lin, 2002).

The VC dimensions of both models suggest a similarity between them and other learning machines whose input is transformed into a feature vector, which is affected by the model effective parameters to produce the system output. For instance, an extreme learning machine (ELM; Huang, 2003; Huang, Wang, & Lan, 2011) has pseudo-randomly generated weights that connect its input layer to its hidden one. These weights are analogous to the addressing function introduced in section 3.1.1. In addition, the content of the hidden-layer neurons is affected by a weighted sum, which produces the ELM output. A similar process happens in WiSARD and bleaching n-tuple classifiers, where the feature vector generated by the addressing function is subjected to the system memory matrices in order to yield the network response. A comparison between the VC dimensions of those models also indicates a similarity between them. The VC dimension of ELMs is determined by the number of weights connecting its hidden to its output layer (Liu, Gao, & Li, 2012). In a similar fashion, the number of addressable positions of an n-tuple classifier is tightly related to its VC dimension.

Despite their similarities, the weightless recognition methods differ in some points from their weighted counterparts. The latter ones rely on optimization methods and update all their weights in a training procedure, while the former ones change the memory matrix elements that are addressed only when the network is being trained. In this way, WiSARD and bleaching n-tuple classifiers can generalize pretty well even if the number of training data are far smaller than their VC dimensions. The same does not happen in ELMs (Liu et al., 2012). Actually, both weightless learning systems seem to have fewer effective parameters than their VC dimensions imply. They have a pay-only-for-what-you-use policy, where the only positions to be updated are the ones that are revealed as relevant during the training procedure.

One might get a glimpse of the impact of this policy by calculating the VC dimension of a particular subset of WiSARD and bleaching N-node n-tuple classifying models. If a WiSARD network is exposed to a training data set D composed of ω1 observations associated with class 1 and ω-1 with class -1, then every node d1,i of discriminator D1 should have at most ω1 positions storing 1. Analogously, every node d-1,i of D-1 should have at most ω-1 positions storing 1. This way, by appending inequalities
j=12nmc,i,jωc,i{1,,N}andc{-1,1}
6.1
to the system introduced in proposition 6, one could obtain a capacity measure lower than the calculated VC dimension, which would better depict how that policy affects WiSARD generalization capacity. The bleaching n-tuple classifier has a very characteristic property. If a learning machine of this sort is exposed to that same training data set, then the sum of the contents of every node dc,i of a discriminator Dc is always ωc. To attain the desired measure, it would be necessary to append equations
j=12nmc,i,j=ωc,i{1,,N}andc{-1,1}
6.2
to its equivalent linear system. But because the system relies on the modified memory matrices Λ-1 and Λ1, and not on M-1 and M1, appending those equations would introduce a nonlinearity to the system that should toughen its resolution. That task becomes easier if those equations are replaced by
-2nN-2-nωcj=12nλc,i,j-2n+1-N-ωc,i{1,,N}andc{-1,1}.
6.3
However, this might pose another problem, as it is quite improbable to find an exact solution, so this measure would have to be expressed by lower and upper bounds.

7  Conclusion

This letter provides the exact values of the VC dimension of WiSARD and of bleaching n-tuple classifiers. Because they are identical, we conclude that the bleaching technique enhances the WiSARD model while doing little to no harm to its generalization capability. This work also draws a parallel with other pattern recognition machines, showing similarities and differences between these and the weightless systems. We argue as well on how the pay-only-for-what-you-use policy of the n-tuple classifiers could be interpreted as an advantage, for it would imply that the model generalization capability should grow according to the data set employed for training.

Other selection criteria can be used for the dynamic threshold β of the bleaching n-tuple classifier, implying the possibility of different tie-breaking policies other than the one introduced in section 2.2.2 and formally defined in section 3.3.1. The study of the generalization capacity of those other policies and the investigation on which one is best suited for a given task is an almost direct consequence of this research and is proposed as future work. A deeper analysis of the impact of the training data set size on the generalization capacity of WiSARD and bleaching n-tuple classifiers is also suggested as a further improvement to this work. Because of their multidiscriminator natures, the basic and bleaching recognition machines could benefit from studies on their generalization capacities employing multicategorical extensions to the VC dimension.

Note

1

Equation 4.2 represents matrices M-1 and M1 that define a WiSARD model WN,n·;M-1,M1WN,n that realizes one particular dichotomy of X. It shows some resemblance to the basic n-tuple method training rule except that in equation 4.2, not all addressed memory positions are set to 1, only those whose addressing tuple is not 0n.

Acknowledgments

The authors would like to thank CNPq, FAPERJ, and FINEP Brazilian research agencies. This study was also financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. The authors would also like to thank Nicholas P. Bradshaw, Ph.D., for his contribution and support.

References

Abu-Mostafa
,
Y. N.
,
Magdon-Ismail
,
M.
, &
Lin
,
H.-T.
(
2012
).
Learning from Data
.
AMLBook
.
Aleksander
,
I.
,
Thomas
,
W. V.
, &
Bowden
,
P. A.
(
1984
).
WISARD: A radical step forward in image recognition
.
Sensor Review
,
4
,
120
124
.
Azhar
,
H. B.
, &
Dimond
,
K.
(
2004
). A stochastic search algorithm to optimize an N-tuple classifier by selecting its inputs. In
A.
Campilho
&
M.
Kamel
(Eds.),
Lecture Notes in Computer Science: vol. 3211. Image analysis and recognition
(pp.
556
563
).
Berlin
:
Springer
.
Bledsoe
,
W. W.
, &
Bisson
,
C. L.
(
1962
).
Improved memory matrices for the n-tuple pattern recognition method
.
IRE Transactions on Electronic Computers
,
EC-11
(
3
),
414
415
.
Bledsoe
,
W. W.
, &
Browning
,
I.
(
1959
).
Pattern recognition and reading by machine
. In
Papers of the Eastern Joint IRE-AIEE-ACM Computer Conference
(pp.
225
232
).
New York
:
ACM
.
Bradshaw
,
N. P.
(
1996
).
An analysis of learning in weightless neural systems
. Ph.D. diss.,
Imperial College London
.
Bradshaw
,
N. P.
(
1997
).
The effective VC dimension of the n-tuple classifier
. In
W.
Gerstner
,
A.
Germond
,
M.
Hasler
, &
J.-D.
Nicoud
(Eds.),
Artificial Neural Networks—ICANN'97
(pp.
511
516
). Lecture Notes in Computer Science, Vol.
1327
.
Berlin/Heidelberg
:
Springer
.
Bradshaw
,
N. P.
, &
Aleksander
,
I.
(
1996
).
Improving the generalization of the n-tuple classifier using the effective VC dimension
.
Electronic Letters
,
32
(
20
),
1904
1905
.
Cardoso
,
D. O.
,
Carvalho
,
D. S.
,
Alves
,
D. S. F.
,
de Souza
,
D. F. P.
,
Carneiro
,
H. C. C.
,
Pedreira
,
C. E.
,
Lima
,
P. M. V.
, &
França
,
F. M. G.
(
2016
).
Financial credit analysis via a clustering weightless neural classifier
.
Neurocomputing
,
183
,
70
78
.
Carneiro
,
H. C. C.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2015
).
Multilingual part-of-speech tagging with weightless neural networks
.
Neural Networks
,
66
,
11
21
.
Carneiro
,
H. C. C.
,
Pedreira
,
C. E.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2017
).
A universal multilingual weightless neural network tagger via quantitative linguistics
.
Neural Networks
,
91
,
85
101
.
Carvalho
,
D. S.
,
Carneiro
,
H. C. C.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2013
). B-bleaching: Agile overtraining avoidance in the WiSARD weightless neural classifier. In
M.
Verleysen
(Ed.),
Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
(pp.
515
520
).
Louvain-la-Neuve, Belgium
:
i6doc
.
de Souza
,
D. F. P.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2014
).
Spatio-temporal pattern classification with Kernel Canvas and WiSARD
. In
Proceedings of the 2014 Brazilian Conference on Intelligent Systems
.
Piscataway, NJ
:
IEEE
.
de Souza
,
D. F. P.
,
França
,
F. M. G.
, &
Lima
,
P. M. V.
(
2015
).
Real-time music tracking based on a weightless neural network
. In
Proceedings of the 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems
.
Piscataway, NJ
:
IEEE
.
Gregorio
,
M. D.
(
1997
).
On the reversibility of multi-discriminator systems
(Technical Report 125/97).
Pozzuoli, Italy
:
Istituto di Cibernetica
.
Grieco
,
B. P. A.
,
Lima
,
P. M. V.
,
Gregorio
,
M. D.
, &
França
,
F. M. G.
(
2010
).
Producing pattern examples from “mental” images
.
Neurocomputing
,
73
(
7–9
),
1057
1064
.
Hsu
,
C.-W.
, &
Lin
,
C.-J.
(
2002
).
A comparison of methods for multi-class support vector machines
.
IEEE Transactions on Neural Networks
,
13
(
2
),
415
425
.
Huang
,
G.-B.
(
2003
).
Learning capability and storage capacity of two-hidden-layer feedforward networks
.
IEEE Transactions on Neural Networks
,
14
(
2
),
274
281
.
Huang
,
G.-B.
,
Wang
,
D. H.
, &
Lan
,
Y.
(
2011
).
Extreme learning machines: A survey
.
International Journal of Machine Learning and Cybernetics
,
2
(
2
),
107
122
.
Jørgensen
,
T. M.
, &
Linneberg
,
C.
(
1999
).
Theoretical analysis and improved decision criteria for the n-tuple classifier
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
21
(
4
),
336
347
.
Linneberg
,
C.
, &
Jørgensen
,
T. M.
(
1999
).
Cross-validation techniques for n-tuple-based neural networks
. In
T.
Lindblad
,
M. L.
Padgett
, &
J. M.
Kinser
(Eds.),
Proceedings of the Ninth Workshop on Virtual Intelligence/Dynamic Neural Networks
.
Bellingham, WA
:
International Society of Optical Engineering
.
Liu
,
X.
,
Gao
,
C.
, &
Li
,
P.
(
2012
).
A comparative analysis of support vector machines and extreme learning machines
.
Neural Networks
,
33
,
58
66
.
Meyer
,
C. D.
(
2000
).
Matrix analysis and applied linear algebra
.
Philadelphia
:
Society for Industrial and Applied Mathematics
.
Mitchell
,
R. J.
,
Bishop
,
J. M.
, &
Minchinton
,
P. R.
(
1996
).
Optimising memory usage in n-tuple neural networks
.
Mathematics and Computers in Simulation
,
40
(
5–6
),
549
563
.
Nascimento
,
D. N.
,
de Carvalho
,
R. L.
,
Mora-Camino
,
F.
,
Lima
,
P. M. V.
, &
França
,
F. M. G.
(
2015
).
A WiSARD-based multi-term memory framework for online tracking of objects
. In
M.
Verleysen
(Ed.),
Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
(pp.
19
24
).
Louvain-la-Neuve, Belgium
:
i6doc
.
Rohwer
,
R.
, &
Morciniec
,
M.
(
1996
).
A theoretical and experimental account of n-tuple classifier performance
.
Neural Computation
,
8
(
3
),
629
642
.
Rohwer
,
R.
, &
Morciniec
,
M.
(
1997
). The generalisation cost of RAMnets. In
M.
Mozer
,
M.
Jordan
, &
T.
Petsche
(Eds.),
Advances in neural information processing systems, 9
(pp.
253
259
).
Cambridge, MA
:
MIT Press
.
Rohwer
,
R.
, &
Morciniec
,
M.
(
1998
).
The theoretical and experimental status of the n-tuple classifier
.
Neural Networks
,
11
(
1
),
1
14
.
Roy
,
R. J.
, &
Sherman
,
J.
(
1967
).
Two viewpoints of k-tuple pattern recognition
.
IEEE Transactions on Systems Science and Cybernetics
,
3
(
2
),
117
120
.
Steck
,
G. P.
(
1962
).
Stochastic model for the Browning-Bledsoe pattern recognition scheme
.
IRE Transactions on Electronic Computers
,
EC-11
(
2
),
274
282
.
Stonham
,
T. J.
(
1977
).
Improved Hamming-distance analysis for digital learning networks
.
Electronics Letters
,
13
(
6
), 155.
Tarling
,
R.
, &
Rohwer
,
R.
(
1993
).
Efficient use of training data in the n-tuple recognition method
.
Electronics Letters
,
29
(
24
), 2093.
Ullmann
,
J. R.
(
1969
).
Experiments with the n-tuple method of pattern recognition
.
IEEE Trans. Comput.
,
18
(
12
),
1135
1137
.
Ullmann
,
J. R.
(
1971
).
Reduction of the storage requirements of Bledsoe and Browning's n-tuple method of pattern recognition
.
Pattern Recognition
,
3
(
3
),
297
306
.
Vapnik
,
V. N.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.
Vapnik
,
V. N.
, &
Chervonenkis
,
A. Y.
(
1971
).
On the uniform convergence of relative frequencies of events to their probabilities
.
Theory of Probability and Its Applications
,
16
(
2
),
264
280
.
Wickert
,
I.
, &
França
,
F. M. G.
(
2001
). AUTOWISARD: Unsupervised modes for the WISARD. In
J.
Mira
&
A.
Prieto
(Eds.),
Connectionist models of neurons, learning processes, and artificial intelligence
(pp.
435
441
).
Berlin
:
Springer
.