## Abstract

A main research direction in the field of evolutionary machine learning is to develop a scalable classifier system to solve high-dimensional problems. Recently work has begun on autonomously reusing learned building blocks of knowledge to scale from low-dimensional problems to high-dimensional ones. An XCS-based classifier system, known as XCSCFC, has been shown to be scalable, through the addition of expression tree–like code fragments, to a limit beyond standard learning classifier systems. XCSCFC is especially beneficial if the target problem can be divided into a hierarchy of subproblems and each of them is solvable in a bottom-up fashion. However, if the hierarchy of subproblems is too deep, then XCSCFC becomes impractical because of the needed computational time and thus eventually hits a limit in problem size. A limitation in this technique is the lack of a cyclic representation, which is inherent in finite state machines (FSMs). However, the evolution of FSMs is a hard task owing to the combinatorially large number of possible states, connections, and interaction. Usually this requires supervised learning to minimize inappropriate FSMs, which for high-dimensional problems necessitates subsampling or incremental testing. To avoid these constraints, this work introduces a state-machine-based encoding scheme into XCS for the first time, termed XCSSMA. The proposed system has been tested on six complex Boolean problem domains: multiplexer, majority-on, carry, even-parity, count ones, and digital design verification problems. The proposed approach outperforms XCSCFA (an XCS that computes actions) and XCSF (an XCS that computes predictions) in three of the six problem domains, while the performance in others is similar. In addition, XCSSMA evolved, for the first time, compact and human readable general classifiers (i.e., solving any *n*-bit problems) for the even-parity and carry problem domains, demonstrating its ability to produce scalable solutions using a cyclic representation.

## 1 Introduction

A learning classifier system (LCS) is a rule-based online learning system (Holland et al., 2000; Bull and Kovacs, 2005) that uses an evolutionary mechanism, usually a genetic algorithm (GA) (Goldberg, 1989), to evolve classifier rules and evaluates the utility of the rules using a machine learning technique. The LCS technique can learn high-dimensional problems based on previous experience from low-dimensional problems in the same domain (Iqbal et al., 2014). However, as the problem increases in dimensionality, resulting in increased search space, it becomes difficult to solve because of the needed resources. By explicitly feeding domain knowledge to an LCS, scalability can be achieved, but it adds bias and restricts use in multiple domains (Ioannides and Browne, 2007).

### 1.1 Goals

A finite state machine (FSM) is a mathematical model that may contain cyclic relations and can be used to represent any finite state system (see Section 2.1). The advantage of cyclic representations is that they can map repeated patterns in a compact form. The aim of this work is to introduce a cyclic representation into XCS through a state-machine-based encoding scheme, for the first time, in an attempt to develop a scalable classifier system to solve high-dimensional problems.

The proposed system is tested on the multiplexer, the majority-on, the carry, the even-parity, the count ones, and the digital design verification problem domains. We employ these Boolean problem domains in this paper because they permit testing the scalability of a system against a series of increasing difficult yet formally related problems. In addition, they are complex problem domains having epistatic, overlapping, niche imbalance, cyclic regularity, and nonpredictive attributes (see Section 4). Each problem domain has a subset of these properties. Such properties can be found in real-world problems, but real-world application is beyond the scope of this paper. The proposed system is expected to be useful in scaling problems that are cyclic in nature, e.g., even-parity problems. However, we also need to test its effectiveness in noncyclic problems, e.g., multiplexer problems, in case the introduced methods hinder performance in such problems. The results are compared with XCSCFA (Iqbal et al., 2013c) and XCSF (Wilson, 2002), which are XCS-based systems where the former computes a classifier’s action and the latter computes a classifier’s prediction.^{1}

### 1.2 Organization

The rest of the paper is organized as follows. Section 2 describes the necessary background in finite state machines and learning classifier systems. In Section 3 the novel implementation of XCS using state machine action is detailed. Section 4 introduces the problem domains and experimental setup used in experimentation for this work. In Section 5 experimental results are presented and compared with XCSCFA and XCSF. Section 6 is a comprehensive discussion explaining the effectiveness of the system resulting from the combination of XCS and FSM. In the last section this work is concluded and the future work is outlined.

## 2 Background

The following sections briefly describe state machines and learning classifier systems as they are directly related to the work presented here, and introduce scalablity in evolutionary machine learning techniques.

### 2.1 Finite State Machines

A finite state machine (FSM), also known as a finite state automaton (FSA), is a mathematical model of computation that can be used to model any finite state system. FSMs have been used to design sequential logic circuits as well as algorithms for different computational tasks such as pattern matching, sequence prediction, communication protocols, and parsing (Anderson, 2006). In general, a finite state machine consists of a finite set of states and can be in only one state at any given time, called the *current state*. On receiving an input, the machine can change its current state and/or cause an action or output to take place for any given change. One of the states is labeled *start state*, which is used as the current state in the beginning while processing the input.

There are many state machine modeling techniques, for instance, deterministic finite automata (DFA), nondeterministic finite automata (NFA), Mealy machines, Moore machines, pushdown automata, and Turing machines (Anderson, 2006). The state machine used in the work presented here is the Moore machine (Moore, 1956), because it is simple and suitable for modeling classification problems. A Moore machine is formally defined as a six-tuple *M* = (*Q*, , , , , *q*_{initial}), where *Q* is a finite set of states, is a finite set of input symbols, is a finite set of output symbols, is a transition function from a state and an input to a next state , is an output function from a state to an output , and is the start state.

For example, the state machine shown in Figure 1 is described as , , , start state = *q*_{0}, defined as , , and , and defined as , , , , , and . Here a circle denotes a state along with the corresponding output value, an arrow represents a transition for a given input symbol, and the start state is marked with an unconnected input arrow.

Usually, a Moore machine outputs a string of symbols from on receiving an input string of symbols from . For example, if the input is “10100”, then the Moore machine depicted in Figure 1 will produce “11101” as the output string. Figure 2 shows the step-by-step details of processing the input string “10100” into the output string “11101”. In the work presented here, a Moore machine is adapted for classification problems. To use a Moore machine for classification, the value of the last state visited while processing the input string is taken as the only output instead of a whole output string. Therefore the class of input string “10100” will be 1, whereas the string “10101” will belong to class 0 if processed via the machine shown in Figure 1.

In the early 1960s, Fogel et al. (1966) evolved state machines using evolutionary programming to predict symbol sequences. Subsequently, different evolutionary approaches were used to evolve FSMs, such as genetic algorithms (Inagaki, 2002), genetic programming (Benson, 2000), evolutionary strategies (Spears and Gordon-Spears, 2003), and hill climber (Lucas and Reynolds, 2007). All these evolutionary techniques produce a state machine as a single solution rather than as a set of cooperative rules as in an LCS. The main aim of this study is to introduce a cyclic representation in LCS rather than developing a competitor for state machines evolution.

### 2.2 Learning Classifier Systems

Traditionally, an LCS represents a rule-based agent that incorporates evolutionary computing and machine learning to solve a given task by interacting with an unknown environment via a set of sensors for input and a set of effectors for actions (Holland et al., 2000; Bull and Kovacs, 2005). After observing the current state of the environment, the agent performs an action, and the environment provides a reward. The goal of an LCS is to evolve a set of classifier rules that collectively solve the problem. The generalization property in LCS allows a single rule to cover more than one environmental state provided that the action-reward mapping is similar. Usually, generalization in an LCS is achieved by using a special “don’t care” symbol (“#”) in classifier conditions, which matches any value of a specified attribute in the vector describing the environmental state. LCS can be applied to a wide range of problems, including data mining, control, modeling, and optimization problems (Bull, 2004; Shafi et al., 2009; Behdad et al., 2012).

#### 2.2.1 XCS

XCS (Wilson, 1995) is a formulation of LCS that uses accuracy-based fitness to learn the problem by forming a complete mapping of states and actions to rewards. In XCS the learning agent evolves a population of classifiers, where each classifier consists of a rule and a set of associated parameters estimating the quality of the rule. Each rule is of the form “if *condition* then *action*,” having two parts: a condition and the corresponding action. Commonly, the condition is represented by a fixed-length bit string defined over the ternary alphabet , where “#” is the “don’t care” symbol, which can be either 0 or 1; and the action is represented by a numeric constant. Each classifier has three main parameters: (1) prediction *p*, an estimate of the payoff expected from the environment if its action is executed; (2) prediction error , an estimate of the errors between the predicted payoff and the actually received reward; and (3) fitness *F*, an estimate of the classifier’s utility.

In the following, XCS operations are concisely described. For a complete description, the interested reader is referred to the original XCS paper by Wilson (1995), and to the algorithmic details by Butz and Wilson (2002).

XCS operates in two modes, explore and exploit. In the explore mode, the agent attempts to obtain information about the environment and describes it by creating the decision rules, using the following steps:

Observes the current state

*s*of the environment, usually represented by a fixed-length bit string defined over the binary alphabet .Selects classifiers from the classifier population that have conditions matching the state

*s*, to form the match set .Performs covering: for every action in the set of all possible actions, if

*a*is not represented in , then a random classifier is generated with a given generalization probability such that it matches_{i}*s*and advocates*a*, and added to the set as well as to the population ._{i}^{2}Forms a system prediction array, for every that represents the system’s best estimate of the payoff should the action

*a*be performed in the state_{i}*s*. Commonly, is a fitness weighted average of the payoff predictions of all classifiers advocating*a*._{i}Selects an action

*a*to explore (probabilistically or randomly) and selects all the classifiers in that advocated*a*to form the action set .Performs the action

*a*, records the reward*r*received from the environment, and uses*r*to update the associated parameters of all classifiers in .When appropriate, implements rule discovery by applying an evolutionary mechanism in the action set , to introduce new classifiers to the population.

Additionally, the explore mode may perform subsumption deletion to subsume more specific classifiers by a general and accurate classifier. Subsumption deletion is a way of biasing the genetic search toward more general but still accurate classifiers (Butz et al., 2004). It also effectively reduces the number of classifier rules in the final population (Kovacs, 1996).

In contrast to the explore mode, in the exploit mode the agent does not attempt to discover new information and simply performs the action with the best predicted payoff. The exploit mode is also used to test learning performance of the agent in the application.

#### 2.2.2 LCS with Rich Encoding Schemes

Various rich encoding schemes have been investigated in LCS to represent high-level knowledge in an attempt to improve the generalization, to obtain compact classifier rules, to reach the optimal performance faster, to generate feature extractors, and to investigate scalability of the learning system (Iqbal et al., 2013c). Most of these schemes have been implemented on Wilson’s XCS, which is a well-tested LCS model.

A genetic programming (GP) (Poli et al., 2008) based rich encoding was used by Ahluwalia and Bull (1999) within a simplified strength-based LCS (Wilson, 1994). They used binary strings to represent the condition and symbolic expression to represent the action of a classifier rule. This GP-based LCS generates filters for feature extraction rather than performing classification directly. The extracted features are used by the *k*-nearest neighbor algorithm to perform classification.

Lanzi extended the fixed-length bit string representation of a classifier condition to a variable-length messy coding in XCSm (Lanzi, 1999). This messy coding of classifier conditions improved the portability of the behaviors learned between different agents in the *Maze4* environment. Then Lanzi and Perrucci (1999) enhanced messy coding to a more complex representation in which LISP symbolic expressions were used to represent a classifier condition in XCSL. In XCSL the subsumption deletion was not used because of the complexity of determining whether a classifier is more general than another one in such an alphabet.

*x*matched by classifier condition and a weight vector

*w*, to learn approximations to functions (Wilson, 2002). In the implemented system, known as XCSF, the classifier condition was changed from a ternary alphabet string to a concatenation of interval-based numeric values. In XCSF the prediction of a classifier is usually computed as given in Equation (1). Here

*x*

_{0}denotes a constant input parameter, and

*n*is the length of the input message. The weights are evolved using recursive least squares, as described by Lanzi et al. (2007).

Bull and O’Hara (2002) developed XCS-based neuro and neuro-fuzzy classifier systems (named X-NCS and X-NFCS), where a condition-action rule was represented by a small neural network; and the action value of a classifier rule was computed by feedforwarding the environmental state to the neural network. Experimental results indicate that neural network based classifier systems are able to learn various single-step as well as multistep problems (Bull and O’Hara, 2002; Hurst and Bull, 2006).

Lanzi (2003) developed an XCS with stack-based genetic programming where the classifier conditions were represented by mathematical expressions using reverse Polish notation (RPN). The system did not restrict the generation of syntactically incorrect conditions; therefore the search space was unnecessarily large.

Lanzi et al. (2005) used XCSF for the learning of Boolean functions. They showed that XCSF can produce more compact classifier rules as compared with XCS, since the use of computed prediction allows more general solutions (Lanzi et al., 2007). Subsequently, Lanzi and Loiacono (2006) extended XCSF to XCSFNN by computing classifier predictions using neural networks. XCSFNN generally produced more condensed populations than XCSF, and it outperformed XCSF in learning the problems in which the target payoff surfaces were highly nonlinear.

Lanzi and Loiacono (2007) introduced a version of XCS with computed actions, named XCSCA, to be used for problem domains involving a large number of actions. The classifier action was computed using a parametrized function in a supervised fashion. XCSCA borrows the idea of supervised learning from UCS (Bernadó-Mansilla and Garrell-Guiu, 2003), and the idea of action mappings from XCSF (Wilson, 2002). In XCSCA classifiers have no prediction because, as in UCS, there is no incoming reward. Consequently, XCSCA does not produce a complete action map; rather it only evolves the correct output function. The reported experimental results show that XCSCA can evolve accurate and compact representations of various binary functions, which would be difficult to solve using the base XCS technique (Lanzi and Loiacono, 2007).

Wilson (2008) implemented classifiers conditions using gene expression programming (GEP). GEP-based conditions have captured and shown greater insight into the environment’s regularities, albeit the evolution of the system was slower, than traditional methods. Dam et al. (2008) implemented a UCS-based classifier system by replacing the typically used numeric action in a classifier rule with a neural network. The developed system, known as NLCS, resulted in better generalization, more compact solutions, and the same or better classification accuracy than a numeric action based UCS in the tested problems.

Preen and Bull (2009) investigated the use of discrete dynamical genetic programming within XCS. In the developed system, known as dDGP-XCS, condition-action rules were represented by random Boolean networks with cyclic connections included. The number of nodes in a network were set equal to the sum of the number of input fields for the given task and its outputs, plus one other node to handle matching of the current input. A rule was said to match the current input if the extra “matching” node in the network was in state 1. The reported results show that dDGP-XCS evolved ensembles of dynamical Boolean function networks to solve the 6-bit multiplexer and a number of maze problems. Bull and Preen (2009) also explored the use of cyclic random Boolean networks in YCS (Bull, 2005), which is a simple version of XCS (Wilson, 1995). The number of nodes in a network was set equal to the number of input fields for the given task. A rule was said to match the current input if a fraction of nodes within the given rule were in state 0 or in state 1. In the former case the matched rule advocated action 0 and in the latter case action 1. The reported results show that the developed system solved the 6-bit and 11-bit multiplexer problems by evolving ensembles of dynamical Boolean function networks.

### 2.3 Scalability in Evolutionary Machine Learning

A main research direction in the field of evolutionary machine learning is to develop a scalable system to solve high-dimensional problems. For complex, large-scale problems, the standard monolithic evolutionary machine learning techniques may not find a solution owing to the large search space leading to an intractable problem. In order to solve large-scale problems, feature selection/construction (Xue et al., 2015; Krawiec, 2002) and layered learning (de Garis, 1990; Asada et al., 1996) have been investigated in the evolutionary machine learning community.

Feature selection aims at reducing the dimensionality of the data by removing redundant/irrelevant features in an attempt to improve the scalability of an algorithm in real-world problems (Xue et al., 2015). However, in high-dimensional problems feature selection is a difficult task because of the large search space. Feature construction aims at combining potentially useful features into new features in an attempt to improve the performance of an algorithm (Krawiec, 2002). Krawiec (2004) considered the feature construction task as an optimization problem and used evolutionary computation to effectively search the solution space. A variant of GP was used to represent an individual in the solution space. Also proposed was a coevolutionary methodology to decompose the feature construction task using cooperative coevolution. The reported results show that the proposed methodology proved effective in learning various real-world problems, and exhibits scalability and generalization. Neshatian et al. (2012) proposed a GP-based multiple feature construction technique for classification problems. The experimental results show that the constructed, high-level features improved the classification performance of symbolic learners. Xue et al. (2013) proposed two multiobjective feature selection algorithms using particle swarm optimization (PSO). The experimental results show that the proposed multiobjective algorithm based on the ideas of crowding, mutation, and dominance outperformed the proposed nondominated sorting-based PSO and three well-known evolutionary multiobjective algorithms on the tested data sets. Later, Xue et al. (2014) investigated a combination of PSO and rough set theory for multiobjective feature selection. The reported results show that the proposed method successfully reduced the number of features while achieving similar or better performance than using all features. Tran et al. (2014) proposed a PSO-based feature selection technique to learn high-dimensional classification problems. The reported results show that the proposed technique successfully reduced the number of features and improved the classification accuracy over using all features.

Layered learning is a machine learning paradigm, formally introduced by Stone and Veloso (2000), where the task to be learned is decomposed into a hierarchy of subtask layers. At each layer a subtask is learned separately, commonly in sequence, and the knowledge learned at lower layers is used to learn the subtask at the next higher layer. Gustafson and Hsu (2001) investigated layered learning in GP to solve the keep-away soccer game, where the main task was decomposed into two subtasks. The final population in the bottom task layer was used as the initial population for the top task layer. The layered learning GP approach evolved better solutions faster than standard GP. Jackson and Gibbons (2007) used a two-layered approach in GP to solve even-parity and majority-on problems. The solutions of the bottom layer were reused as parametrized modules to learn the main task in the top layer. The layered learning approach outperformed standard GP, albeit it could not achieve a success rate for the higher-order problems. Hien et al. (2011) implemented layered learning with incremental sampling in GP to solve twelve symbolic regression problems. The proposed system reduced the training time and complexity of the evolved solutions. Hoang et al. (2011) investigated layered learning in tree adjoining grammar guided GP (TAG3P) (Hoai et al., 2006) to solve symbolic regression, even-parity, and the ORDERTREE problems. The developed system, named DTAG3P, produced more structured and scalable solutions than standard GP and TAG3P. However, DTAG3P introduced a number of new parameters into TAG3P.

Building blocks in GP have been analyzed (Smart et al., 2007) and used for simplification of GP trees during the evolutionary process in order to control bloat (Wong and Zhang, 2007; Kinzett et al., 2009). Iqbal et al. (2014) encoded classifier conditions in XCS using GP tree-like code fragments in order to extract building blocks of knowledge from smaller problems and reuse them to learn higher-dimensional problems of the domain. Code fragments act as knowledge extractors, similar to other alternatives such as automatically defined functions (ADFs) (Koza, 1994). Code fragments are more flexible, as ADFs have a predefined structure and a fixed number of arguments. The resulting system, named XCSCFC (XCS with code-fragment conditions), rapidly solved problems of a scale that existing LCS and GP techniques could not, for instance, the 135-bit multiplexer problem. XCSCFC is specially beneficial if the target problem can be divided into a hierarchy of subproblems and each of them is solvable in a bottom-up fashion. However, if the hierarchy of subproblems is too deep, then XCSCFC will become impractical because of the needed computational time, and eventually hit a limit in problem size. We implemented the code fragment encoding scheme in the action of a classifier rule in XCSCFA (XCS with code-fragment actions), which solved various complex Boolean problems but could not evolve solutions for higher-dimensional problems (Iqbal et al., 2012; 2013a; 2013c; 2015).

A limitation in both XCSCFC and XCSCFA is the lack of a cyclic representation to encapsulate the underlying repeated patterns in a problem domain. In principle they could have cycles, but it is difficult to avoid infinite loops in the evolved solutions (Larres et al., 2010). A finite state machine is a cyclic representation, which has the ability to encapsulate repeated patterns in a problem domain and does not get stuck in infinite loops. However, the evolution of FSMs is a hard task because of the combinatorially large number of possible states, connections, and interactions. Usually this requires supervised learning to minimize inappropriate FSMs, which for large-scale problems necessitates subsampling and/or incremental testing. To avoid these constraints, we introduced a state-machine-based encoding scheme into XCS (Iqbal et al., 2013b), which is significantly extended in this paper.

## 3 XCS with Finite State Machine Actions

In the work presented here, the typical static numeric action in XCS is replaced by an FSM in an attempt to develop a scalable learning classifier system to solve high-dimensional Boolean problems.

In the proposed approach of XCS with state machine actions, called XCSSMA, the static binary action is replaced by a Moore state machine (Moore, 1956) retaining the ternary alphabet in the condition of a classifier rule. Each state machine consists of *n* states, where some of the states may be deactivated in order to provide flexibility in terms of the required number of states. Each state is encoded as a four-tuple {*m*, *v*, , *T*}, where is the unique state identification number, is the output value of the state *p*, is a Boolean flag determining whether the state *p* is activated, and *T* denotes transitions from state *p* to a next state for each input symbol .

A state machine is encoded as a string, which is a combination of each state’s encoding. For simplicity, the first state in each machine’s encoding is set to be the start state, but the order of other states does not matter. For example, the state machine shown in Figure 3, where a deactivated state *q*_{1} is represented by a dashed circle, can be encoded as “”, where the states are concatenated in the order *q*_{2}, *q*_{0}, and *q*_{1}. Here , , , , is defined as , , and , and is defined as , , , , , and . The string “”, representing the state machine shown in Figure 3, is explained in Table 1.

. | Description of state p
. | ||||
---|---|---|---|---|---|

Encoded State “p” . | id . | . | . | . | . |

21110 | 2 | 1 | yes | q_{1} | q_{0} |

00102 | 0 | 0 | yes | q_{0} | q_{2} |

11001 | 1 | 1 | no | q_{0} | q_{1} |

. | Description of state p
. | ||||
---|---|---|---|---|---|

Encoded State “p” . | id . | . | . | . | . |

21110 | 2 | 1 | yes | q_{1} | q_{0} |

00102 | 0 | 0 | yes | q_{0} | q_{2} |

11001 | 1 | 1 | no | q_{0} | q_{1} |

If a state *q _{j}* has been deactivated and an activated state

*q*has a transition to

_{i}*q*, then that transition will be changed to any activated state in the machine, chosen uniformly randomly, as suggested by Spears and Gordon-Spears (2003); for instance, in Figure 3 the transition will be set to or . To avoid the creation of any junk machine in the evolutionary process, the start state will always be activated.

_{j}The proposed XCSSMA approach extends standard XCS, described in Section 2.2, in the following aspects: the action value, the covering operation, the rule discovery operation, the procedure comparing equality of two state machine actions, and the subsumption deletion mechanism. The rest of this section describes these extensions.

### 3.1 State Machine Action Value

The action value of a classifier is determined by processing the current input string *s* via the state machine action in the classifier. The processing starts from the start state in the state machine action, and the value of the last state visited is taken as the action value. For example, consider the classifier shown in Figure 4. If the input string *s* is “100101”, then the action value will be 0.

### 3.2 Covering Operation

Covering occurs if an action *a* is missing in the match set . If so, a random classifier is created whose condition matches the current environmental state *s* and contains “#” symbols with probability . The state machine action is randomly generated until its output is *a*. The covering operation is described in Algorithm 1. Here *n* is the length of condition in a classifier rule, and is the probability of the “don’t care” symbol “#” in condition of the newly created classifier in the covering operation.^{3}

### 3.3 Rule Discovery Operation

In the rule discovery operation, a GA is applied in the action set to produce two offspring. First, two parent classifiers are selected from using a tournament selection based on fitness, and the offspring are created from them. Next, the conditions and state machine actions of the offspring are separately crossed over, with probability , by applying the two-point crossover operation. As state machines have been encoded as strings of integers, they are crossed over and mutated using the standard GA operations. Note that the start states may be swapped during the crossover operation, but the resulting machines by crossover should not contain duplicate states. The crossover operation is described in Algorithm 2. Here *n* is the length of condition , and is the length of state machine action in a classifier rule.

After that, each symbol in the conditions of the resulting children by crossover is mutated with probability , such that both children match the currently observed state *s*. Then, the state machine actions of the children are mutated with probability . Note that state numbers are not mutated in order to avoid duplicate states in a state machine action. The mutation operation is described in Algorithm 3. Here *n* and are the length of condition and state machine action in a classifier rule, respectively, and is the mutation probability.

The prediction and prediction error of the offspring are set to the average of the parents’ values, whereas the fitness of the offspring is set to the average of the parents’ values multiplied by the constant *fitnessReduction*, as suggested by Butz and Wilson (2002).

### 3.4 Comparing Two State Machine Actions

If a newly created classifier in the rule discovery operation is not subsumed and there is no classifier equal to it in the population, then it will be added to the population. Two classifiers are considered to be equal if and only if both have the same conditions and genotypically the same state machine in their actions. The state machine *genotype* is its formal encoding as seen in the classifier action, and the *phenotype* is the value that the action computes to with a given input. The procedure to compare two state machine actions for equality is given in Algorithm 4. Here is the length of state machine action in a classifier rule.

### 3.5 Subsumption Deletion

A classifier *cl*_{1} can subsume another classifier *cl*_{2} if both have the same action and *cl*_{1} is accurate, sufficiently experienced, and more general than *cl*_{2} (Butz and Wilson, 2002). It is to be noted that because of the multiple genotypes to a single phenotype mapping of state machine actions in classifier rules, subsumption deletion is less likely to occur. Subsumption deletion is still made possible by matching the state machine descriptions on a character-by-character basis.

With these extensions, we expect the new system XCSSMA can effectively solve high-dimensional problems from different complex Boolean problem domains.

## 4 Experimental Design

The following sections describe the problem domains and experimental setup used in experimentation for this work.

### 4.1 Problem Domains

The problem domains used are the multiplexer, the majority-on, the carry, the even-parity, the count ones, and the digital design verification. These are complex Boolean problem domains having to varying degrees epistatic, overlapping, niche imbalance, cyclic regularity, and nonpredictive attributes; they permit testing the scalability of a system against a series of increasingly difficult yet formally related problems.

A multiplexer is an electronic circuit that accepts input strings of length and gives one output. The value encoded by the first *k* address bits is used to select one of the remaining data bits to be given as output. For example, in the 6-bit multiplexer, if the input is 011011, then the output will be 0, as the first two bits 01 represent the index 1 (in base ten), which is the second bit following the address. Multiplexer problems are highly nonlinear, multimodal, and have epistasis, that is, the importance of data bits is dependent on address bits.

In majority-on problems, the output depends on the number of 1s in the input instance. If the number of 1s is greater than the number of 0s, the problem instance is of class 1, otherwise class 0. In the majority-on problem domain, the complete solution consists of strongly overlapping classifiers, so it is difficult to learn. For example, 1##11:1 and 11#1#:1 are two maximally general and accurate classifiers, but they overlap in the 11*11 subspace.^{4}

In a carry problem, two binary numbers of the same length are added. If the result triggers a carry, then the output is 1, otherwise 0. For example, in the case of two 3-bit numbers 101 and 010, the output is 0 (as 101 + 010 = 111 with no carry), whereas for the numbers 110 and 100 the output is 1 (as 110 + 100 = 010 with a carry). As with majority-on problems, the complete solution in the carry problem domain consists of strongly overlapping classifiers, and in addition it is a niche imbalance problem domain. Carry problems contain cyclic regularity in the problem input that is expected to be useful for deciding the answer to the problem in XCSSMA.

In even-parity problems, the output depends on the number of 1s in the input instance. If the number of 1s is even, the output will be 1, and 0 otherwise. Similar to carry problems, even-parity problems contain cyclic regularity, and using the ternary alphabet based conditions with the static numeric action, no useful generalizations can be made for even-parity problems.

Count ones problems are similar to majority-on problems, but in count ones problems only *k* bits are relevant in an input instance of length *l*, and the remaining *l*−*k* bits represent nonpredictive attributes (Butz, 2006). If the number of 1s in the *k* relevant positions is greater than half *k*, the problem instance is of class 1, otherwise class 0. For example, consider a count ones problem of length *l* = 7 with the first *k* = 5 relevant bits. In this problem, input 1010110 would be class 1, whereas input 1001011 would be class 0. Similar to majority-on problems, the complete solution for a count ones problem consists of overlapping classifiers.

In digital design verification, a digital design is verified before manufacturing in order to discover as many bugs in the design as possible. A digital design is usually verified in the following two ways: (1) formal verification to prove formal properties of the design, and (2) simulation-based verification to produce tests that exercise the functionality of the design (Ioannides et al., 2011). The design verification (DV) problem used in this work is a simulation-based verification of a digital signal processor called FirePath, at the accumulator stage in the long pipeline of its design. This simulation-based verification of the FirePath signal processor is actually a 7-bit binary classification problem, named DV1, originally introduced by Ioannides et al. (2011). For further details of the FirePath signal processor and the DV1 problem, the interested reader is referred to the original paper by Ioannides et al. (2011). A Boolean function can be represented compactly in the sigma notation by listing each onset row from the truth table of the function (Dandamudi, 2003). For example, the function can be represented in sigma notation as (1, 3). The DV1 problem is denoted by the following sigma notation: (1, 2, 3, 8, 9, 10, 11, 13, 14, 24, 25, 26, 27, 28, 30, 40, 41, 42, 43, 46, 47, 56, 57, 58, 59, 61, 65, 66, 67, 69, 70, 71, 72, 73, 74, 75, 77, 78, 79, 81, 82, 83, 85, 86, 88, 89, 90, 91, 93, 94, 95, 97, 98, 99, 101, 102, 103, 104, 105, 106, 107, 109, 110, 113, 114, 115, 117, 118, 121, 122, 123, 125, 126, 127). As with majority-on and count ones problems, the complete solution for the DV1 problem consists of overlapping classifiers, and in addition it is a niche imbalance problem.

The properties of different problem domains used in this work are summarized in Table 2.

Problem Domain . | Properties . |
---|---|

Multiplexer | Multimodal and epistatic |

Majority-on | Overlapping |

Carry | Overlapping, niche imbalance, and cyclic |

Even-parity | Cyclic and hard to generalize |

Count ones | Overlapping and nonpredictive attributes |

Design verification | Overlapping and niche imbalance |

Problem Domain . | Properties . |
---|---|

Multiplexer | Multimodal and epistatic |

Majority-on | Overlapping |

Carry | Overlapping, niche imbalance, and cyclic |

Even-parity | Cyclic and hard to generalize |

Count ones | Overlapping and nonpredictive attributes |

Design verification | Overlapping and niche imbalance |

### 4.2 Experimental Setup

Unless stated otherwise, systems use the following parameter values, commonly used in the literature, as suggested by Butz and Wilson (2002): learning rate ; fitness fall-off rate ; prediction error threshold ; fitness exponent ; threshold for GA application in the action set ; two-point crossover with probability ; mutation probability ; experience threshold for classifier deletion ; fraction of mean fitness for deletion ; classifier experience threshold for subsumption ; probability of “don’t care” symbol in covering ; reduction of the fitness ; and the selection method is tournament selection with tournament size ratio 0.4. The GA subsumption is activated, but the action set subsumption is deactivated, as suggested by Iqbal et al. (2013c).

As all the problems to be experimented in this work are Boolean problems having only two classes, the state machine parameters are input alphabet and output alphabet . For the sake of simplicity, the maximum number of states in a state machine is set as follows: = the length of the input message if it is less than 10, else = 10. In XCSF, the constant input parameter ; and .

The number of training examples is 2 million in all the experiments conducted here, unless stated otherwise. Explore and exploit problem instances are alternated. The reward scheme used is 1,000 for a correct classification and 0 otherwise. All the experiments were repeated 30 times with a known different seed in each run. Each result reported in this work is the average of the 30 runs.

In all graphs presented here, the *x*-axis is the number of problem instances used as training examples, the *y*-axis is the performance measured as the percentage of correct classification during the last 1,000 exploit problem instances, except for the 6-bit multiplexer problem, where a 50-point running average is used in order to get more informative performance curves in this small-scale problem, and the error bars show the standard deviation in the 30 runs.

## 5 Results

In order to test the scalability of XCSSMA, results were compared with XCSCFA (Iqbal et al., 2013c) and XCSF (Lanzi et al., 2005) on the six problem domains: multiplexer, majority-on, carry, even-parity, count ones, and digital design verification.

To analyze the results, we applied an analysis of variance (ANOVA) (Piater et al., 1998) on the performance curves to test whether there was any statistically significant difference with a confidence level of ; and then we applied four post hoc tests, Tukey HSD, Scheffé, Bonferroni, and Holm, to find which, if any, method performed significantly differently than others.^{5}

### 5.1 Multiplexer Problem Domain

The performance of XCSCFA, XCSF, and XCSSMA in learning the multiplexer problems is shown in Figures 5 and 6. The number of classifiers used is 800, 1,500, 2,000, 8,000, and 30,000 for the 6-, 11-, 20-, 37-, and 70-bit multiplexer problems, respectively. The 37-bit multiplexer problem was not solved with , so this was increased to (Figure 6b). For the 70-bit multiplexer problem, was set to , and was set to , as commonly used in the literature (Butz, 2006), and 5 million training examples were used.

The multiplexer is a niche balanced problem domain, and there exists a complete solution for multiplexer problems that does not contain any overlapping classifier rules. So, all three methods successfully solved the 6-, 11-, 20-, 37-, and 70-bit multiplexer problems. However, as the problem size increased, the performance of XCSCFA decreased as compared to the other methods.

The statistical analysis of the results showed that there is no statistically significant difference in learning the 6-bit and 11-bit multiplexer problems using XCSCFA, XCSF, and XCSSMA. In learning the 20-bit multiplexer problem, there is no significant difference between XCSCFA and XCSSMA; however, both of these methods perform significantly better than XCSF. In learning the 37-bit multiplexer problem, XCSCFA performs significantly better than XCSF and XCSSMA, and XCSSMA performs better than XCSF. In learning the 70-bit multiplexer problem, XCSF performs significantly better than XCSCFA and XCSSMA, and XCSSMA performs better than XCSCFA.

In order to compare XCSSMA with an XCS having cyclic representation to compute actions, it can be noted that Preen and Bull (2009) reported approximately 40,000 exploit trials being required with an XCS using discrete dynamical genetic programming (dDGP-XCS) to solve the 6-bit multiplexer problem, that is, considerably more than the 4,000 exploit trials seen in Figure 5a for equivalent parameter settings. Bull and Preen (2009) also explored the use of discrete dynamical genetic programming in YCS (Bull, 2005), and solved the 6-bit and 11-bit multiplexer problems using 20,000 and 100,000 exploit trials, respectively, which is still considerably more than the 4,000 and 20,000 exploit trials seen in Figures 5a and 5b, respectively.

The largest solved multiplexer problem, directly from training data, reported in the literature is the 135-bit multiplexer problem (Urbanowicz and Moore, 2015). However, XCSCFA, XCSF, and XCSSMA could not solve the 135-bit multiplexer problem.

The analysis of evolved classifier rules indicates that XCSSMA could not improve the generalization ability of rules in multiplexer problems. The inherent property of indexing to a certain position according to the address bits in the input problem makes the creation of a state machine difficult in the multiplexer domain. A sample classifier rule for the 20-bit multiplexer problem is shown in Figure 7.^{6} This rule is equivalent to numeric action based XCS rule “0010##1############# : 1”. The states *q*_{0}, *q*_{2}, *q*_{3}, and *q*_{4} are not reachable from the start state *q*_{1} in the FSM action of this rule. The processing of any matched input message ends at the state *q*_{1} that has output value 1; therefore the action value of this rule will be 1. Essentially, this is a single bit action, which is the same as in XCS.

### 5.2 Majority-on Problem Domain

We believe that majority-on problems are hard to learn because the complete solution for a majority-on problem consists of overlapping classifiers. The largest attempted majority-on problem in the literature is a 7-bit problem, by Jackson and Gibbons (2007) using layered learning in genetic programing, and the reported success rate is . The performance of XCSCFA, XCSF, and XCSSMA in learning majority-on problems is shown in Figure 8. The number of classifiers used is 3,000, 5,000, and 7,000 for the 7-, 9-, and 11-bit majority-on problems, respectively.

All three methods successfully solved the 7-bit majority-on problem. When the problem size was increased to 9-bit, XCSF reached approximately performance but could not completely solve the 9-bit majority-on problem, whereas XCSCFA and XCSSMA successfully solved it. When the problem size was further increased to 11-bit, both XCSCFA and XCSF failed to solve it (XCSCFA reached performance, and XCSF reached ), whereas XCSSMA successfully solved it, as shown in Figure 8c.

The statistical analysis of the results showed that there is no statistically significant difference in learning the 7-bit and 9-bit majority-on problems using XCSCFA and XCSSMA; however, both these methods perform significantly better than XCSF. In learning the 11-bit majority-on problem, XCSSMA performs significantly better than XCSCFA and XCSF, and XCSCFA performs better than XCSF.

A sample accurate classifier rule obtained in learning the 7-bit majority-on problem using XCSSMA is shown in Figure 9. This is an interesting rule covering 16 problem instances, 11 of which belong to class 0 and the other five to class 1.

### 5.3 Carry Problem Domain

The complete solution in the carry problem domain consists of overlapping classifiers; in addition it is a niche imbalance domain, which makes it very difficult to learn. The largest solved carry problem, directly from training data, reported in the literature is the 6+6 bit carry problem, by Harding et al. (2010). The performance of XCSCFA, XCSF, and XCSSMA in learning carry problems is shown in Figure 10. The number of classifiers used is 5,000, 7,000, and 9,000 for the 4+4, 5+5, and 6+6 bit carry problems, respectively.

It is observed that all three methods solved the 4+4 bit and 5+5 bit carry problems; however XCSSMA required fewer training instances than the others to reach the 100% performance level. When the problem size was increased to 6+6 bit, XCSCFA reached performance, and XCSF reached performance, but none of them could completely solve the 6+6 bit carry problem, whereas XCSSMA successfully solved it, as shown in Figure 10c. In addition, the obtained solutions in XCSSMA are compact, easily understandable,^{7} and general for any *n*+*n* bit carry problem.

The statistical analysis of the results showed that there is no statistically significant difference in learning the 4+4 bit carry problem using XCSCFA, XCSF, and XCSSMA. In learning the 5+5 bit carry problem, there is no significant difference between XCSCFA and XCSSMA; however, both these methods perform significantly better than XCSF. In learning the 6+6 bit carry problem, XCSSMA performs significantly better than XCSCFA and XCSF, and XCSCFA performs better than XCSF.

One of the classifier rules from the final solution obtained using XCSSMA is shown in Figure 11a. This is a maximally general and accurate classifier covering the whole problem space. The FSM action in this rule is general to solve any *n*+*n* bit carry problem. Note that state *q*_{0} and state *q*_{3} are not active, so there is no transition from any active state to these deactivated states. The state *q*_{2} is active, but not reachable from the start state *q*_{1}. It means only two states, *q*_{1} and *q*_{4}, are the working states in this FSM action. The deactivated and nonreachable states can be removed, in a postprocessing step, to simplify state machine actions in the final solution if needed, as shown in Figure 11b. The solutions found by Harding et al. (2010) were also general, but not as compact and human-readable as the classifier rules obtained using XCSSMA here.

### 5.4 Even-Parity Problem Domain

The performance of XCSCFA, XCSF, and XCSSMA in learning even-parity problems is shown in Figure 12. The number of classifiers used is 2,000, 4,000, and 8,000 for the 7-, 8-, and 9-bit even-parity problems, respectively. The obtained results indicate that XCSSMA solved the 7-, 8-, and 9-bit even-parity problems very quickly, using only a few thousand training instances. XCSCFA successfully solved the 7-, 8-, and 9-bit even-parity problems, whereas XCSF reached approximately performance but could not completely solve the 7-, 8-, and 9-bit even-parity problems.

The largest solved parity problem, directly from training data, reported in the literature is the 24-bit even-parity problem, by Harding et al. (2010). The performance of XCSCFA, XCSF, and XCSSMA in learning the 24-bit even-parity problem is shown in Figure 13. Note that the number of classifiers and the number of training instances (used in XCSCFA and XCSF) were increased to 20,000 and 5 million, respectively. XCSCFA and XCSF could not learn the 24-bit even-parity problem, even using a larger population of classifiers. This is consistent with findings by Butz et al. (2004) on learning bounds. That work noted that the number of rules needs to correspond to the number of required building blocks, which is huge in this problem domain under standard noncyclic representations. However, XCSSMA not only solved the 24-bit even-parity problem but the obtained solutions are compact, easily understandable, and general for any *n*-bit even-parity problem.

The statistical analysis of the results showed that in learning the 7-, 8-, and 9-bit even-parity problems XCSSMA performs significantly better than XCSCFA and XCSF, and XCSCFA performs better than XCSF. In learning the 24-bit even-parity problem, XCSSMA performs better than XCSCFA and XCSF; however, there is no significant difference between XCSCFA and XCSF.

One of the classifier rules from the final solution obtained in learning the 24-bit even-parity problem using XCSSMA is shown in Figure 14a. This is a maximally general and accurate classifier covering the whole problem space. The FSM action in this rule is general to solve any *n*-bit even-parity problem. State *q*_{0} is not active, so there is no transition from any active state to this deactivated state. The states *q*_{1} and *q*_{4} are active but not reachable from the start state *q*_{3}. Thus, only two states, *q*_{2} and *q*_{3}, are the working states in this FSM action. The simplified rule is shown in Figure 14b. The solutions found by Harding et al. (2010) were also general but not as compact and human-readable as the classifier rules obtained using XCSSMA here.

### 5.5 Count Ones Problem Domain

The largest count one problem reported in the literature is of length *l* = 100 with the first *k* = 7 relevant bits (Butz, 2006). The performance of XCSCFA, XCSF, and XCSSMA in learning the count ones problems is shown in Figure 15. The number of classifiers used is 3,000, and 5,000 for the 7-bit and 9-bit count ones problems, respectively. For the count ones problems, was set to , and was set to , as used by Butz (2006).

XCSCFA and XCSSMA successfully solved the 7-bit count ones problem, whereas XCSF reached approximately performance but could not completely solve the 7-bit count ones problem, as shown in Figure 15a. When the problem size was increased to 9-bit, XCSCFA, XCSF, and XCSSMA reached approximately , , and performance, respectively, but none of them could completely solve the 9-bit count ones problem, as shown in Figure 15b. It is anticipated that the nonpredictive attributes make it hard to evolve a complete solution in the count ones problem domain using the computing action/prediction based methods XCSCFA, XCSF, and XCSSMA.

The statistical analysis of the results showed that in learning the 7-bit count ones problem, there is no significant difference between XCSCFA and XCSSMA; however, both of these methods perform significantly better than XCSF. In learning the 9-bit count ones problem XCSCFA performs significantly better than XCSF and XCSSMA, and XCSSMA performs better than XCSF.

A sample accurate classifier rule obtained in learning the 7-bit count ones problem using XCSSMA is shown in Figure 16. This rule covers 16 problem instances, 15 of which belong to class 0 and one to class 1.

### 5.6 Design Verification Problem Domain

The performance of standard XCSCFA, XCSF, and XCSSMA in learning the DV1 problem is shown in Figure 17. The number of classifiers used is 3,000. XCSF reached approximately 99% performance level but could not completely solve the DV1 problem, whereas XCSCFA and XCSSMA successfully solved it.

The statistical analysis of the results showed that in learning the DV1 problem, there is no significant difference between XCSCFA and XCSSMA; however, both these methods perform significantly better than XCSF.

A sample accurate classifier rule obtained in learning the DV1 problem using XCSSMA is shown in Figure 18. This rule covers 32 problem instances, that is, 0 to 7, 16 to 23, 32 to 39, and 48 to 55, 29 of which belong to class 0 and three to class 1.

In summary, XCSSMA performs statistically significantly better than XCSCFA and XCSF in learning high-dimensional problems in the majority-on, carry, and even-parity problem domains, as shown in Table 3. In other problem domains, we found the performance of XCSSMA comparable to XCSCFA and XCSF. It is worth noting that XCSSMA did not perform worse than either XCSCFA or XCSF in any problem tested here, as can be seen from the last column in Table 3.

Problem Domain . | Problem Size . | Best Performance . | Worst Performance . |
---|---|---|---|

Multiplexer | 6-bit | No significant difference | No significant difference |

11-bit | No significant difference | No significant difference | |

20-bit | XCSCFA, XCSSMA | XCSF | |

37-bit | XCSCFA | XCSF | |

70-bit | XCSF | XCSCFA | |

Majority-on | 7-bit | XCSCFA, XCSSMA | XCSF |

9-bit | XCSCFA, XCSSMA | XCSF | |

11-bit | XCSSMA | XCSF | |

Carry | 4+4 bit | No significant difference | No significant difference |

5+5 bit | XCSCFA, XCSSMA | XCSF | |

6+6 bit | XCSSMA | XCSF | |

n+n bit | Only XCSSMA generated applicable solutions | Only XCSSMA generated applicable solutions | |

Even-parity | 7-bit | XCSSMA | XCSF |

8-bit | XCSSMA | XCSF | |

9-bit | XCSSMA | XCSF | |

24-bit | XCSSMA | XCSCFA, XCSF | |

n-bit | Only XCSSMA generated applicable solutions | Only XCSSMA generated applicable solutions | |

Count ones | 7-bit | XCSCFA, XCSSMA | XCSF |

9-bit | XCSCFA | XCSF | |

DV1 | 7-bit | XCSCFA, XCSSMA | XCSF |

Problem Domain . | Problem Size . | Best Performance . | Worst Performance . |
---|---|---|---|

Multiplexer | 6-bit | No significant difference | No significant difference |

11-bit | No significant difference | No significant difference | |

20-bit | XCSCFA, XCSSMA | XCSF | |

37-bit | XCSCFA | XCSF | |

70-bit | XCSF | XCSCFA | |

Majority-on | 7-bit | XCSCFA, XCSSMA | XCSF |

9-bit | XCSCFA, XCSSMA | XCSF | |

11-bit | XCSSMA | XCSF | |

Carry | 4+4 bit | No significant difference | No significant difference |

5+5 bit | XCSCFA, XCSSMA | XCSF | |

6+6 bit | XCSSMA | XCSF | |

n+n bit | Only XCSSMA generated applicable solutions | Only XCSSMA generated applicable solutions | |

Even-parity | 7-bit | XCSSMA | XCSF |

8-bit | XCSSMA | XCSF | |

9-bit | XCSSMA | XCSF | |

24-bit | XCSSMA | XCSCFA, XCSF | |

n-bit | Only XCSSMA generated applicable solutions | Only XCSSMA generated applicable solutions | |

Count ones | 7-bit | XCSCFA, XCSSMA | XCSF |

9-bit | XCSCFA | XCSF | |

DV1 | 7-bit | XCSCFA, XCSSMA | XCSF |

## 6 Discussion

The developed system, XCSSMA, can be viewed from three different perspectives. First, it evolves a completely general condition where the entire pattern is captured by the FSM action, for example, in even-parity and carry problems. Second, the FSM action really does nothing except provide a [0, 1] action, and the pattern is only captured by the rule conditions, as in multiplexer problems. Third, XCSSMA yields a mix of both, where the pattern is partly captured by the FSM actions and partly by the rule conditions, for example, in majority-on and count-ones problems. It is worth noting that XCSSMA can select any of these three ways automatically without adjustment. The evolved solutions in the first and the second cases are more human-readable than those in the third case because in the latter it is required to interpret both a condition and an FSM action across multiple rules.

An FSM is an abstract model that can represent a finite-state system in a compact form, but the evolution of FSMs is a hard task. Usually, FSMs are evolved using supervised learning, so for high-dimensional problems some form of subsampling or incremental testing is needed (Benson, 2000; Spears and Gordon-Spears, 2003; Lucas and Reynolds, 2007). The online learning, niche-based breeding, and generalization properties of XCS-based systems implicitly provide incremental testing and subsampling of the training data set. Hence, the developed XCSSMA system, as a combination of XCS and FSMs, rapidly evolved the general FSMs for even-parity and carry problems, and a set of FSMs for the other problems experimented on here, where each evolved machine covers a subspace of the problem matched by the corresponding classifier condition.

It is to be noted that if an FSM is used as the action of a classifier rule in place of a static numeric action, the size of the search space increases. This was compensated for by improving the generalization ability of XCS in the majority-on, carry, even-parity, count ones, and DV problem domains. This improvement is obvious in the accurate rules shown in Figures 9, 11, 14, 16, and 18, which match 16 problem instances from the 7-bit majority-on, all the problem instances from the 6+6 bit carry, all the problem instances from the 24-bit even-parity, 16 problem instances from the 7-bit count ones, and 32 problem instances from the DV1 problems, respectively. The generalization to this level is beyond the ability of XCS using ternary alphabet based conditions along with a numeric action. Scaling is shown by (1) the increased dimensionality in a problem domain, and (2) evolving a solution in a smaller-scale problem that generalizes to any *n*-bit high-dimensionality problem in certain problem domains.

The FSM-based action could not improve the generalization beyond numeric action based XCS for the multiplexer domain because the state machines needed for this domain are more complex than the other domains. The inherent property of indexing to a certain position according to the address bits in the input problem makes the creation of a state machine difficult in the multiplexer domain. In such problem domains other alternative representations, such as neural networks (Dam et al., 2008), may be more effective than state machines. It is to be noted that up to the 135-bit MUX have been solved using XCSCFC (Iqbal et al., 2014), which reuses building blocks of knowledge extracted from smaller problems in the domain; and ExSTraCS 2.0 (Urbanowicz and Moore, 2015), which utilizes supervised learning as well as the innovative attribute feedback and attribute tracking mechanisms.

Although the XCSSMA system can have a genotypically large number of actions, the actual computed action has only two possible values (0 or 1) for the Boolean problems experimented on here. Therefore, the complexity of XCSSMA in terms of time will not be huge.

## 7 Conclusions

By utilizing a cyclic representation, XCSSMA outperformed existing techniques in learning high-dimensional problems in the majority-on, carry, and even-parity problem domains, that is, problems that contain cyclic regularity when relating input to output. In domains without this cyclic nature its performance reverted to being similar to XCS with a single bit action. Importantly, XCSSMA evolved, for the first time, compact and human-readable general classifiers (i.e., solving any *n*-bit problems) for the even-parity and carry problem domains. It is found that the nonpredictive attributes in a problem make it harder for it to be solved using computing action/prediction based classifier systems.

Future work includes designing a mechanism to compare two state machines semantically instead of syntactically, as at present, in order to fully enable the subsumption deletion process in XCSSMA. Further investigation on the effect of redundant states in a state machine action having a large number of states is also desired. It is anticipated that XCSSMA can be extended to learn sequence prediction and multistep tasks, as there are advanced forms of state machines (such as pushdown automata and Turing machines) that have a memory component.

## References

## Notes

^{1}

Previously, XCSCFA showed better performance than standard numeric action based XCS in the problem domains used in this paper (Iqbal et al., 2013c).

^{2}

If the classifier population size grows larger than the specified limit, then one of the classifier rules has to be deleted so that the new rule can be inserted.

^{3}

Although the repeat-until loop in Algorithm 1 may take several iterations before a machine is created that has action value equal to *a*, it will not be an infinite loop.

^{4}

Here, * can be 0, 1, or #.

^{5}

The statistical tests were conducted using a calculator available on the website http://statistica.mooo.com/.

^{6}

The sample rules presented in this work were obtained using only five states in state machine actions to save space and for better readability.

^{7}

Assuming the reader understands how state machines work.