Image pattern classification is a challenging task due to the large search space of pixel data. Supervised and subsymbolic approaches have proven accurate in learning a problem’s classes. However, in the complex image recognition domain, there is a need for investigation of learning techniques that allow humans to interpret the learned rules in order to gain an insight about the problem. Learning classifier systems (LCSs) are a machine learning technique that have been minimally explored for image classification. This work has developed the feature pattern classification system (FPCS) framework by adopting Haar-like features from the image recognition domain for feature extraction. The FPCS integrates Haar-like features with XCS, which is an accuracy-based LCS. A major contribution of this work is that the developed framework is capable of producing human-interpretable rules. The FPCS system achieved 91 1% accuracy on the unseen test set of the MNIST dataset. In addition, the FPCS is capable of autonomously adjusting the rotation angle in unaligned images. This rotation adjustment raised the accuracy of FPCS to 95%. Although the performance is competitive with equivalent approaches, this was not as accurate as subsymbolic approaches on this dataset. However, the benefit of the interpretability of rules produced by FPCS enabled us to identify the distribution of the learned angles—a normal distribution around —which would have been very difficult in subsymbolic approaches. The analyzable nature of FPCS is anticipated to be beneficial in domains such as speed sign recognition, where underlying reasoning and confidence of recognition needs to be human interpretable.
Images provide a rich source of information for artificial agents from object recognition to salient patterns. Historically, computer vision has been considered one of the hard applications for machine learning, primarily due to the challenges posed by the high-dimensionality of image data (Osuna et al., 1997). While many computer vision problems can be largely solved with modern supervised (as in “with ground truth data labels available”), off-line (as in “the entire dataset is available at once”) learning algorithms by generalizing over sufficiently large training sets with available ground truth, visual learning through reinforcement remains a challenging task. Furthermore, it may be necessary to examine the reasoning behind the decisions made and the confidence in those decisions, for example, speed sign recognition in autonomous driving cars.
Learning classifier systems (LCSs) have their roots in cognitive systems and are capable of dealing with simultaneous-response classification problems where they can learn general rules for complex multi-class problems. Reinforcement learning is particularly useful in dynamic and unknown scenarios where obtaining correct examples that represent all the situations that the agent may encounter is impractical (Sutton and Barto, 1998). LCSs combine genetic evolutionary operators and reinforcement learning to evolve a population of decision rules. The result is a system that enables agents to successfully learn to operate within unknown, and possibly dynamic, environments (Orriols-Puig et al., 2009; Butz et al., 2004).
This work introduces adopting LCSs for the pattern recognition domain (Kukenys et al., 2011a, 2011b) since the capabilities of LCSs have been minimally explored in that domain. LCSs have been applied to handwritten letter classification (Frey and Slate, 1991). For almost two decades (from 1991 to date) there was little work performed using LCSs since the reported results were not optimal due to available technology at that time. However, advancements in various aspects of machine learning and vision processing techniques are anticipated to lead to significant improvement in the performance of LCSs.
A major issue when adopting LCSs in the image domain is creation of conditions for rules. Traditional systems use pixel-level information to create their conditions, for example Frey and Slate (1991) used 16 numerical attributes representing primitive statistical features of the pixel distribution. Conditions that are based on pixel-level values are low-level and may not provide informative features. Moreover, such conditions do not scale well as the size of the images grows.
The first contribution of this paper is adapting LCSs to the image recognition domain. This is achieved through the development of a framework called the feature pattern classification system (FPCS). To the best of our knowledge, this is the first time that LCSs have been assimilated into the image recognition domain. In order to adjust LCSs for vision pattern recognition tasks, various modules that need amendments were identified and modified. The second contribution investigates how Haar-like features can be utilized to produce conditions in LCSs. Identifying important features in computer vision and pattern classification applications is extremely difficult due to the sparseness of patterns compared with the total number (or type) of features. We demonstrate how Haar-like features were adjusted to work with LCSs. In addition, we apply FPCS to online, dynamic situations where new classes of the problem may be introduced into the system. We demonstrate that FPCS is capable of adapting in such dynamic domains. The last contribution of this work utilizes the flexibility of the framework to include image manipulation in an analyzable format. We attempt to improve the classification rate by enabling the FPCS to autonomously adjust the rotation angle of images. The human-interpretable rules enabled the analysis of the results to simply identify the distribution of the learned angles.
The MNIST dataset (LeCun et al., 1998) was chosen as a benchmark for testing FPCS. It contains images of the real world and thus provides a realistic test problem. It is a competitive benchmark, and has been widely used for evaluating the performance of various methods on the handwritten-digit recognition problem. It serves as a standard for comparison between performance of different pattern recognition methods. We compare the result of the FPCS on the MNIST dataset against state of the art techniques published in the literature even if their aim was different, that is, pure accuracy of classification rather than online regime and human-interpretable rules.
The rest of this paper is structured as follows. Section 2 reviews different classification techniques for the pattern recognition domain. Section 3 describes the various components of the FPCS. It provides the details of LCSs and how they were integrated with Haar-like features. Moreover, it explains how LCSs have been adjusted for the image domain. Section 4 provides the details of the benchmark dataset, and performance results of the FPCS in both off-line and online scenarios. Section 5 demonstrates how the FPCS can automatically adjust to rotation angle in unaligned images. Section 6 compares the results of the FPCS to other classification methods on the same dataset. Section 7 provides the discussion and the future work, and finally Section 8 concludes the paper.
2 Related Work
Various machine learning techniques have been used in the pattern classification domain. This section briefly introduces relevant techniques and identifies their advantages and disadvantages in the pattern recognition domain.
Random forest classifiers (RFs) or randomized trees were introduced in the machine learning field by Amit and Geman (1997) and further developed by Breiman (2001). They have been applied to object recognition and classification tasks (Bosch et al., 2007a; Moosmann et al., 2008). RFs are defined as a collection of tree-like structures where each tree represents a classifier. Each tree is determined by values of a random vector that is sampled interdependently and identically for all trees. RFs offer a probabilistic output and are capable of sharing features similar to multi-class classifiers. The aforementioned features in addition to robustness (with respect to noise) of this classification method led to their applications in various supervised classification tasks (Maree et al., 2005; Deselaers et al., 2007). RFs are known to have issues such as lack of generalization and overfitting, although in terms of performance they are comparable to support vector machines (SVMs) in multi-class problems (Bosch et al., 2007b).
SVMs, or kernel methods, are a modular framework that can be used in different tasks by adjusting their kernel functions and base algorithm (Schölkopf and Smola, 2002). They have been extensively applied to the pattern classification domain. One of the main advantages of SVMs is that nonlinear decision boundaries can be learned using the so-called kernel trick (Maji et al., 2008). However, the nonlinear property adds to the complexity of the runtime. In addition, SVMs offer a fast training and classification rate, and require a substantially smaller amount of memory than methods using nonlinear kernels. This latter result is due to compact representation of the decision function (Maji et al., 2008). Thus, linear kernel SVMs have become a popular method that have been applied to several online applications (Zhang et al., 2006). They have also been applied to object recognition (Grauman and Darrell, 2005; Lazebnik et al., 2006). The SVMs results on Caltech and Pascal VOC datasets are among the best known results (Varma and Ray, 2007; Bosch et al., 2007b). A major drawback when using SVMs is that they can only classify data vectors that have fixed length. Therefore, they do not suit classification tasks dealing with variable length data. In addition, SVMs have mainly been used in supervised domains, although a number of researchers have recently tried to adopt semi-supervised and unsupervised techniques in SVMs (Zhao et al., 2009).
A large number of applications of neural networks (NNs) to pattern recognition have been developed in the past few years. These applications utilize and extend different types of neural-network architecture including multi-layer perceptron (MLP), radial basis function (RBF), self-organizing map (SOM), shared weight neural networks (LeCun et al., 1990) and probabilistic neural networks (PNNs; Musavi et al., 1994; Romero et al., 1997; Specht, 1990). Among these structures, PNNs have become a popular method for classification in various domains due to their ease of training and sound statistical foundation in Bayesian estimation theory (Mao et al., 2000). However, PNNs have a major issue with respect to determining the size of the network and the locations of pattern layer neurons. In PNNs, the pattern layer includes all training samples, of which many may be redundant. The issue of including the redundant samples leads to large network structures. Large network structures are computationally expensive since the computation required for classifying an unknown pattern is proportional to the size of the network (Mao et al., 2000). Moreover, large network structures tend to provide poor generalization in the case of unseen data (Nigrin, 1993). A number of studies have tried to address this issue, for example, Mao et al. (2000) proposed a mechanism that restricts the network size and utilizes a genetic algorithm to find the smoothing parameters.
Deep belief networks (DBNs) of restricted Boltzmann machines (RBMs) have recently been used in pattern classification. DBNs follow a hierarchical structure in which layers at the bottom of the hierarchy extract simple features and feed them to the higher layers, which then are able to detect complex features. There have been various approaches to learning deep networks (Ranzato et al., 2007; Hinton et al., 2006) and they can benefit from advances in both supervised and unsupervised learning. DBNs have been successfully used to learn high-level structures in a wide variety of domains, including handwritten digits (Larochelle et al., 2007). Although DBNs have successfully been used in controlled environments, utilizing them in realistic situations remains difficult due to high dimensionality of images. Lee et al. (2009) proposed a DBN that uses an unsupervised learning method. They adopted the approach suggested by LeCun et al. (1989) to learn features that are common across all locations in an image. The claim is that their model is capable of handling large images using only a small number of feature detectors.
SVMs and RFs do not suit our task since they generally suit supervised scenarios. In addition, LCSs allow for cooperation between rules without the fixed length of data restriction posed by SVMs (Bull et al., 2007). NNs, especially DBNs, have achieved good performance results for image classification in an unsupervised manner. However, rules produced by NNs cannot be interpreted by humans easily. Moreover, LCSs remove redundancy, which is a known issue in PNNs.
The computational overhead of evolutionary computation component of the LCSs results in slower initial off-line training (Huy et al., 2009). However, LCSs can be configured as online, reinforcement learning systems that can adapt to changes in the problem domain relatively quickly. Such features made LCSs a suitable choice for expanding the capability of techniques applied to image recognition.
3 Feature Pattern Classification System
This section introduces various components of the FPCS including the LCS and its parameters. It also describes how Haar-like features were utilized for creating classifier conditions. In addition, it explains how various components of LCSs, including covering, crossover, mutation, and condition matching, were adjusted to function with Haar-like features.
3.1 Learning Classifier System Concept
An LCS represents an agent enacting an unknown environment via a set of sensors for input and a set of effectors for actions. After observing the current state of the environment, the agent performs an action, and the environment provides a reward (Lanzi et al., 2007).
This work utilizes the XCS formulation of LCSs, which was proposed by Wilson (1995). The XCS uses accuracy-based fitness to learn the problem by forming a complete mapping of states and actions to rewards. In addition, XCS evolves more general classifiers subject to having the accuracy criterion. Thus XCS is a suitable choice for the pattern detection domain.
XCS has two modes of operation, explore and exploit. The former refers to the training period where the system explores the environment and learns through exploring examples and the latter refers to the the situations where the system selects the best rules describing the problem. The following provides more details about these modes.
In the explore mode, the system attempts to obtain information about the environment and describe it by creating decision rules. During the explore mode, the system executes the following actions:
Observes the current state of the environment, .
Selects classifiers from the classifier population that have conditions matching the state s, to form the match set.
Performs covering: for every action in the set of all possible actions, if ai is not represented in , a random classifier is generated that matches s and advocates ai, and is added to the population.
Forms a system prediction array, P for every . is a fitness weighted average of the payoff predictions of all classifiers advocating ai. The prediction array provides the system’s best estimate of the payoff for each action ai. Algorithm 1 shows how the prediction array is calculated in details.
Selects an action ai to explore (probabilistically or randomly) and selects all the classifiers in that advocated ai to form the action set .
Performs the action ai, recording the reward from the environment, r, and uses r to update the predictions of all classifiers in .
When appropriate, runs a genetic algorithm (GA) to introduce new classifiers to the population. The GA has two operators: crossover and mutation. In XCS, two parent classifiers are selected from and two offspring are produced by applying crossover on their conditions, such that both offspring match the currently observed state. For mutation, randomly selected features of a classifier condition are mutated to maintain the diversity of population.
Each classifier has a field called numerosity. New classifiers are produced during the explore mode and their numerosity values are set to 1. Basically, when a new classifier is produced, the entire population is checked to see whether the new classifier has a similar condition and action as any existing classifier. If that is the case, then the new classifier is not added to the population and the numerosity value of the existing classifier is increased by one.
XCS may also execute subsumption after the GA generates new classifiers. The subsumption mechanism is in charge of aggregating over specific classifiers into matching and genotypically more general rules that can subsume the specific rules. In addition, subsumption has a delete method which removes classifiers that have low fitness values.
In contrast, in the exploit mode, the system does not attempt to create new rules or learn alternative rules. The system exploits its best current prediction and adapts the associated rules based on the environmental interaction.
The environment is assumed to have the Markov property, meaning that performing the same action in the same state will result in the same reward. LCSs have been shown to be robust to small amounts of noise and are often more robust than most machine learning techniques with increasing amounts of noise (Butz, 2006). The generalization property in LCSs allows a single rule to cover more than one state provided that the action-reward mapping is similar (more information about the theoretical aspects of LCSs can be found in Drugowitsch, 2008).
3.2 Image Pattern Classification Approaches
Each classifier has a condition that when satisfied results in the classifier being added to the match set. In order to create conditions when dealing with images, two approaches may be considered. These approaches are described in this section.
3.2.1 Naïve Pixel-Based Conditions
To learn compact and general models, LCSs utilize generalized condition rules in the individual classifiers. In simple ternary encoding schemes, generalizing conditions are achieved using a special don’t care symbol (#). Consider simple binary pixel black (0) and white (1) images, where every image can be encoded as a string of nine bits. To learn the distinction of images where the center pixel is white from images where the center pixel is black, two classifiers would be sufficient (see Figure 1).
Significant image differences at the pixel level are a well known problem in computer vision, and it is commonly tackled using some form of feature extraction. In the next section we show how Haar-like features can be used to enable LCS applications for image classification.
3.2.2 Haar-like Feature Conditions
The classifier conditions must allow generalization while being accurate, meaning that a classifier must be as general as possible but not over-general. These properties allow a classifier to offer maximally general learning. We argue that the suggested Haar-like multi-feature conditions exhibit that property.
Generalization. The symbolic encoding gained with the don’t care symbol (#). Haar-like features achieve this by ignoring the image information outside of the feature positions and by thresholding the feature values. An extreme case of generalization (all #) can be achieved by setting a threshold on a feature such that every feasible image pattern will match.
Accuracy/Specificity. Every condition can be made more specific by adding more features to it. Essential for ensuring this property is the type-zero Haar-like feature introduced here that simply returns the sum of the pixel intensities within a rectangular region, which effectively enables very precise thresholding of individual pixel values if needed. An extreme case of specificity, when no generalization is possible, is a set of type-zero single pixel features that completely describe a single unique image.
In practice, LCS learning attempts to select a good trade-off between the two extremes, as it has evolutionary pressures (Butz, 2006) for both accuracy and generalization, and the Haar-like multi-feature conditions provide sufficient flexibility for the search along this front, as the experimental results suggest.
3.3 Adjusting LCSs for Image Patterns
In order to use LCSs with images, several adjustments must be made to the standard XCS implementation. These changes include modifying the covering, crossover, mutation, and condition matching components of the LCSs. The following describes the details of these changes.
When performing covering, a random number of features were generated for the condition, randomly selecting feature type, position, scale, and direction, but setting the threshold to the current value of the feature, ensuring that the condition matches the currently observed state.
During uniform crossover of two classifiers, individual feature conditions are moved between the classifiers with equal probability. Since all the features in both classifiers had to match the observed state in order to be selected to the action set, the resulting children will also match the current instance from the environment.
Every property of every feature was allowed to mutate randomly, except for the thresholds, where after mutation each threshold value was adjusted to match the observed state if needed. The idea here is that eventually classifier rules would emerge with thresholds for feature values that cover related groups of problem instances.
3.3.4 Condition Validation
In the cases where mutation moved the feature window to an infeasible region in the image, the offspring was subsumed by the parent classifier by increasing the numerosity of the latter.
4 Evaluating FPCS
This section provides the details of the benchmark dataset and experiments. Several experiments have been executed to investigate whether the FPCS approach is feasible and maintains the benefits of LCSs. These benefits include human-interpretable rules, where the value of the rules can be identified by assessing the statistical parameters associated with each classifier, for example, experience, fitness, prediction.
We selected the problem of handwritten digit classification to test our proposed FPCS with Haar-like features. The MNIST dataset has widely been used by the research community (LeCun et al., 1998; Lee et al., 2009; Bernard et al., 2009; Jarrett et al., 2009). It contains two different sets of data for training and testing. The training dataset includes 60,000 example images of all ten handwritten digits , collected from individuals. The test set contains 10,000 examples written by different individuals. The examples are presented as pixel grayscale (pixel intensity values ranging ) images, centered around the pixel intensity center of mass. The proposed system does not utilize human constructed preprocessing of the training data, where preprocessing of the image is known to improve results (LeCun et al., 1998). In many papers, this problem has been considered sufficiently complex and studied thoroughly (Bernard et al., 2009; Larochelle et al., 2007; Lee et al., 2009; Ciresan et al., 2011).
4.2 Implementation Details
We have used an implementation of XCS based on the XCSJava project developed by Butz (2006). The code was adjusted to work with image patterns. The adjustments were mainly performed on the following components.
Six types of Haar-like features were designed: a single rectangle sum (type-zero feature that is not known to be used in other Haar-related approaches), a two rectangle difference (horizontal, vertical), a three rectangle difference (horizontal, vertical), and a four rectangle difference feature (see Figure 2).
4.2.2 Messy Encoding
Figure 3 shows a histogram of the number of features in classifiers when the classifier limit was set to features in a separate trial, indicating that most of the classifiers had or fewer features, with majority having between three and six. Thus we allowed each classifier to have a random number of up to eight features, with each condition having to exceed the feature threshold when applied to the image for the classifier condition to match. During mutation, the feature list in the classifier was allowed to shrink or grow (adding a new feature based on the currently observed example) with a small probability of .
4.2.3 XCS Inherited Parameters
The system uses the following parameter values, as defined in XCS (Butz, 2006): fitness fall-off rate ; prediction error threshold ; fitness exponent ; learning rate ; threshold for GA application in the action set ; experience threshold for classifier deletion ; fraction of mean fitness for deletion ; classifier experience threshold for subsumption ; crossover probability ; mutation probability (significantly higher than typical XCS applications due to a large variability in the Haar-like features); tournament selection fraction .
4.2.4 Population and Generations
The population size was limited to 60,000 classifiers, and the experiments were run for different numbers of generations specified separately for each experiment. One generation represents one instance of the problem (one state message).1 While the population size may seem large for the problem at hand, it is common for LCSs. Figure 4 shows performance of the partial evaluation of the population after a trial run. It demonstrates that the fittest classifiers are responsible for over of the performance.
All the experiments were repeated times, and the reported numbers are averages, with standard deviation where applicable.
4.3 Measuring FPCS Accuracy
This experiment was designed to demonstrate the recognition performance of the FPCS in an off-line scenario. The FPCS was executed using the training data of the MNIST dataset for 4,000,000 generations requiring 15–20 hours for each of the 30 runs. Figure 5 shows the behavior of the classification performance on the training data and relative population size. As Figure 5 shows, the system reaches 80% accuracy relatively quickly (after 500,000 generations) but it requires more time to reach 90% accuracy. During this period, the system combines more specific rules into more general and accurate rules (micro classifiers vs. macro classifiers). As can be interpreted from the diagram, the population of rules reaches 42,000 (70% of the 60,000 limit) at its peak but then drops to 33,000 (55% of the 60,000 limit) at the end of the experiment. This is due to macro classifiers that have numerosity greater than one being formed by the system. Once the rules are learned, then this system can classify in real time. We applied the learned rules of this experiment to the MNIST test data. The FPCS achieved a overall classification rate on the unseen test set in less than a minute.
4.4 Human-Interpretable Rules
Figure 6 shows example classifiers learned in the previous experiment as feature images. It must be noted that some classifier conditions are intuitively interpretable and target the regions of high contrast where the curves of handwritten digits will consistently pass through, while others are harder to interpret yet are useful to the system due to their collaborative nature.
Figure 7 shows an example of the rules learned by the FPCS. Here Exp represents the experience of the classifiers. The experience refers to the number of times that a rule has been used. N is the numerosity, F is the fitness, and P is the classifier’s prediction. The rule condition includes HAAR types (Haar-like features described in Figure 2) followed by their position, scale, and direction. The first rule (r1) has been used and generated 21 and 19 times. Therefore, it represents a relatively generic and experienced rule. The high fitness value in combination with high experience and numerosity suggests that this rule is a general and accurate rule. The next rule, r2, is less experienced and generic than r1 but still very accurate (F0.8101). The high number of features can be the reason for the low experience as it requires all thresholds to be reached to match. The r3 is an experienced and relatively generic rule but not very accurate. One can infer that the condition of this rule may be over-general and therefore may very occasionally match different digit classes. The last rule, r4, presents an extremely experienced and generic rule due to high experience coupled with high prediction.
4.5 Examining FPCS in an Online Scenario
This experiment was designed to demonstrate the online learning capability of the FPCS. In this experiment, the agent must adapt to dynamic situations where new classes of the problem are introduced to the system during the runtime. An instance of the FPCS was executed where initially only two of digit classes, 0 and 1, which are learned easily, were introduced to the system. The subsequent digit classes were added to the training every 200,000 generations. Figure 8 shows the performance of the FPCS. The online nature of LCSs enabled the system to partially recover from the performance drop effected by unseen classes of examples.
5 Automatic Adjustment of Rotation Angle in Unaligned Images
Although the performance of the FPCS is reasonably good (around 91%), it does not reach the performance of some other benchmarks (for a comparison with other techniques refer to Table 1, discussed in Section 6). We hypothesized that since most people write on a slant, being able to autonomously learn to adapt to an image angle would improve the performance. In order to test this hypothesis, the angle was modeled as a precondition in the classifiers’ rules so when a classifier condition is being examined for an image, the image is rotated with the amount specified in the precondition and then the extracted feature condition is compared with the classifier’s rule condition.
|Depiction .||System .||Method .||Error rate .||Learning .|
|Subsymbolic||(LeCun et al., 1998)||1-layer NNs||12%||Supervised|
|(Larochelle et al., 2007)||polynomial SVMs||3.69 0.17%||Supervised|
|(Lee et al., 2009)||DBNs||0.80%||Unsupervised|
|(Ciresan et al., 2011)||Convolutional nets||0.27 0.02%||Unsupervised|
|Haar||(Fleuret and Sahbi, 2003)||SVMs||3.93%||Supervised|
|Depiction .||System .||Method .||Error rate .||Learning .|
|Subsymbolic||(LeCun et al., 1998)||1-layer NNs||12%||Supervised|
|(Larochelle et al., 2007)||polynomial SVMs||3.69 0.17%||Supervised|
|(Lee et al., 2009)||DBNs||0.80%||Unsupervised|
|(Ciresan et al., 2011)||Convolutional nets||0.27 0.02%||Unsupervised|
|Haar||(Fleuret and Sahbi, 2003)||SVMs||3.93%||Supervised|
5.1 Automatically Learning Angles of Images
We performed an experiment that enables classifiers to automatically learn the angles of images. When making the covering match set the system allowed for constructing classifiers with angles between to in increments. Mutation and crossover were allowed to occur on the precondition, but it was always implemented so not subject to the match method. The experiment was run for 12,000,000 generations requiring 10 to 12 days for each of the 30 runs. The number of generations is higher compared to the previous experiments since learning the angle increases the complexity of the problem. Therefore, the agent requires more time for discovering optimal rules. Rotating the images prior to classification is also time-consuming. Figure 9 shows the behavior of the FPCS in an off-line scenario when the agent autonomously learned classifiers’ angles. The system reached 99% accuracy on the training set after 1,000,000 generations and continued performing at the same level until the end of the experiment (12,000,000 generations in total). The rules produced after 12,000,000 generations, achieved 95% accuracy on the MNIST test data (the decrease from 99% training performance indicates a lack of generalization, which was possibly caused by overfitting).
The original implementation of FPCS was executed for 12,000,000 generations, so the accuracy of the rotation enabled FPCS can be compared with the results of the original FPCS. The original FPCS achieved 94% accuracy after 12,000,000 generations. This suggests that original Haar-like features can cope with rotation and that additional rotation does improve test performance slightly. In addition, the slight overfitting observed in Figure 9 is the result of imprecise rotation of features.
5.2 Distribution of Classifiers’ Angles
The interpretable rules enables humans to gain an insight into the problem by analyzing the rules. If it is assumed that the majority of the MNIST subjects were right-handed and write left-to-right, then a skew in the distribution of angles could have been expected. However, the result shows that this assumption does not hold in the MNIST dataset.
Figure 10 shows the distribution of angles learned by classifiers after 12,000,000 generations. Our analysis of the distribution angles reveals a normal distribution of the learned angles around (upright), which is interesting as the LCSs could have selected angles in the the range to to rotate the images so the distribution could have been any distribution, for example, flat. Therefore, the FPCS is capable of automatically learning the distribution of handwritten letter angles.
5.3 Wrapper-Based Approach to Learning Angles
A wrapper-based approach for learning classifiers’ angles was also implemented with the hope of producing the best classifiers more quickly. In this approach, a wrapper was developed for the classifier rules produced by the original implementation of the FPCS (2,500 rules). Since those rules were the final product of the FPCS, the genetic component of the wrapper, which performs mutation and crossover on the rules’ conditions, was turned off. In addition, the preconditions (with GA) were added so that the optimum preconditions (angles) can be learned.
We doubled the rules and assigned a angle to the introduced precondition of half of the classifiers and a normal distribution of angles in the range of to to the rest of the classifiers. The system was run for another generations on the training set. Using this approach, only 91% accuracy was achieved on the test set. This two-stage wrapper approach could have been more effective if the initial training were on upright images only. It is considered that Haar-like feature condition rules can alter the angles in an image, so rotating those rules would not be beneficial. Instead, generating features and associated rotation concurrently produced better results.
5.4 Testing the Impact of the Learning on FPCS
In order to demonstrate how learning angles for classifiers impacted the FPCS ability in digit recognition in situations where digits are written on an angle, DigitApp was developed. DigitApp is a human interface that interacts with rules. It allows users to draw a digit on a frame, and uses classifiers produced by the two versions of the FPCS (the original FPCS and FPCS that automatically learns angles). It is noted that once the classification has been learned, the rules classify in real-time.
Figures 11 and 12 show a snapshot of the DigitApp. Each instance of the Digit-App shows two tables: original and angled. The original table shows the system digit recognition values when using the original FPCS, and the angled table shows the same values using FPCS that learned angles for classifiers. The first column of each table contains the digits ordered by the confidence values, and the second column includes the confidence values calculated by the systems.
Figure 11 shows how the system recognizes the digit 7 when the digit is written upright. In that position, both systems recognize the digit with high confidence values. The image was then rotated by . As Figure 12 shows, the confidence values for both systems dropped. The original FPCS has not been trained to learn angles and therefore the fall in confidence value is expected. The drop in confidence values of new system (angled) is due to increase in the complexity of the problem (the system had to learn angles). Despite the drop in confidence values, the FPCS that used active angle learning is more accurate at recognizing the rotated 7 than the original FPCS. In fact, this pattern also holds for the other digits (classes). The ensemble nature of LCSs is shown as the confidence of the classes is evident, which could be beneficial in problem domains where decisions are based on the digit recognition, for example, speed sign recognition.
6 Comparison of Various Classification Techniques
Table 1 shows the comparison of the proposed method to the performance of other known systems on the MNIST dataset. According to this table, subsymbolic approaches generally perform well on the MNIST dataset. However, it must be noted that most of these methods are supervised and therefore can only be applied to off-line scenarios. DBNs and convolutional networks are two examples of the subsymbolic methods that have achieved high accuracy and can be employed in unsupervised scenarios. However, these methods do not produce human-interpretable rules, and the knowledge built by the system is not readily available for human interpretation.
The FPCS utilizes reinforcement learning and therefore it is suited to online scenarios. In addition, the original FPCS, and the version that automatically adjusts the rotation angle, take 15–20 hr, and 10-12 days on a single machine for training. Moreover, the FPCS offers human-analyzable rules. None of the other systems mentioned in the table offer this property. Human-interpretable rules are particularly useful when dealing with more complex problems where in-depth knowledge of the system is necessary for better understanding of the problem.
7 Discussion and Future Work
This work sought to adapt a technique that has hypothesized benefits to a domain, rather than achieving the highest classification accuracy regardless of other constraints and objectives. It focused on adapting LCSs to the image recognition task, which has several advantages including human-interpretable rules. This feature enabled the interpretation of the statistics and values associated with the classifiers. The statistical values are an inherent attribute of the LCS technique and are extremely useful in advancing the understanding of the problem and learned solutions.
The investigation revealed that LCS deliver on the promise to form a generalizing model using human-interpretable rules. The current Haar-like multi-features approach learns descriptions of patterns that are comparable to other Haar related approaches, such as the AdaBoost cascade (Viola and Jones, 2001). AdaBoost algorithms create a set of classifiers by maintaining a set of weights over the training set and adjusting the weights after each boosting iteration. They are capable of creating generalized rules. However, accuracy dilemma is known in AdaBoost, which means that as the accuracy of the two classifiers increases, there is less chance that they will disagree (Dietterich, 2000; Li et al., 2008). This impacts the balance between accuracy and diversity since AdaBoost can only present good generalization performance when there is a balance between accuracy and diversity.
The ability of the FPCS in an online learning scenario was demonstrated as it could adapt to newly introduced classes. In online learning methods, the system automatically identifies the changes in the environment state and therefore there is not a need for human operators to determine when supervised off-line training must be resumed.
Several studies that have achieved high performance results on the MNIST dataset have preprocessed images so they can benefit from image realignment. It was studied how the FPCS system can benefit from autonomously learning variation factors. We specifically studied learning the angles of images by adding a precondition to the classifier rules and showed that it improved the accuracy of image classification. The implementation of DigitApp demonstrated how the system recognition for determining the class images written on a slant was improved.
The FPCS takes 15–20 hr to be trained. This was increased to 10 to 12 days when the complexity of the problem was increased as the system adapted to image angles. If the application domain is known in advance, then this slow training is acceptable since the evolved rules execute very fast (ms). However, in online applications with novelty, for example, learning features in a new disaster zone, this technique would not be appropriate.
Further extensions to the FPCS will enable the system to be adapted for online scenarios, such as real-time speed sign recognition. Furthermore, other image feature types should be explored to determine which ones can be most effectively used with LCSs. Finally, this foundation work has enabled future versions to be able to include scaling and transportation.
Although the aim of this work was to develop human-interpretable methods and analysis for visual pattern recognition, it is worth considering how to improve the performance for the given dataset. A supervised version of LCS, for example, based on Bernadó-Mansilla and Garrell-Guiu (2003), could be implemented, which would have the advantage of the ability to repair incorrect rules to the known class or cover gaps in knowledge. It would still have the advantage of human-interpretable rules, but it would be unlikely to have speed advantages (evolutionary computation is often slower if compared with subsymbolic approaches). Many of the high-performing methods use the ‘kernel trick’ to alter the featured dimensions, which may prove an interesting avenue for the feature manipulation in LCSs.
This work investigated how LCSs can be adjusted to work in the pattern recognition domain. Our investigations shows that the LCS technique can be successfully applied to the field of pattern classification and demonstrates novel functionality and promising results. The generalization capability of the LCS in combination with the messy encoding enabled formation of compact, accurate, and general classifiers.
The human-interpretable nature of production rules, which is anticipated to be required in many real-world domains, was assisted by flexible encoding. This feature assists humans in gaining in-depth knowledge of systems. In addition, FPCS was enabled to automatically adjust the rotation angle in unaligned images. Automatic rotation alignment improved recognition accuracy of the FPCS on the MNIST dataset. This foundation work has enabled future versions to be able to include scaling and transportation. In order to further improve the classification rate of FPCS, more high-level features capable of capturing crucial features, such as curves and angles, may prove useful. The transfer of knowledge from off-line learning in similar domains to online scenarios will be investigated to leverage FPCS’ online learning abilities while mitigating long training times.
Generations in LCSs are different from many other forms of evolutionary computation where in each generation all examples are evaluated.