## Abstract

A number of representation schemes have been presented for use within learning classifier systems, ranging from binary encodings to artificial neural networks. This paper presents results from an investigation into using a temporally dynamic symbolic representation within the XCSF learning classifier system. In particular, dynamical arithmetic networks are used to represent the traditional condition-action production system rules to solve continuous-valued reinforcement learning problems and to perform symbolic regression, finding competitive performance with traditional genetic programming on a number of composite polynomial tasks. In addition, the network outputs are later repeatedly sampled at varying temporal intervals to perform multistep-ahead predictions of a financial time series.

## 1 Introduction

Traditionally, learning classifier systems (LCS; Holland, 1976) use a ternary encoding to generalise over the environmental inputs and to associate appropriate actions. A number of representations have previously been presented beyond this scheme, however, including real numbers (Wilson, 2000), fuzzy logic (Valenzuela-Rendón, 1991), and artificial neural networks (Bull, 2002b). Temporally dynamic representation schemes within LCS represent a potentially important approach since temporal behaviour of such kinds are viewed as significant aspects of artificial life, biological systems, and cognition in general (Ashby, 1952).

In this paper, we explore examples of a general dynamical system representation within XCSF (Wilson, 2001)—termed dynamical genetic programming (DGP; Bull, 2009). Traditional tree-based genetic programming (GP; Koza, 1992) has been used within LCS both to calculate the action (Ahluwalia and Bull, 1999) and to represent the condition (e.g., Lanzi and Perrucci, 1999). DGP uses a graph-based representation, each node of which is constantly updated in parallel, and evolved using an open-ended, self-adaptive scheme. We show that XCSF is able to solve a number of computational tasks using this temporally dynamic knowledge representation scheme.

## 2 Related Work

### 2.1 Genetic Programming in Learning Classifier Systems

A significant benefit of symbolic representations is the expressive power to represent complex relationships between the sensory inputs (Mellor, 2005). LISP S-expressions composed of a set of Boolean functions (i.e., AND, OR, and NOT) have been used to represent symbolic classifier conditions in LCS to solve Boolean multiplexer and woods problems (Lanzi and Perrucci, 1999), and to extract useful knowledge in a data mining assay (Lanzi, 2001). An analysis of the populations (Lanzi et al., 2008) has subsequently shown an increasing prevalence of subexpressions through the course of evolution as the system constructs the required building blocks to find solutions. However, when logical disjunctions are involved, optimality is unattainable because the symbolic conditions highly overlap, resulting in classifiers sharing their fitness with other classifiers and thereby lowering the fitness values (Lanzi, 2007). Ioannides and Browne (2008) later extended this approach to further include arithmetic functions (i.e., PLUS, MINUS, MULTIPLY, DIVIDE, and POWEROF) as well as domain specific functions (i.e., VALUEAT and ADDROF) to solve a number of multiplexer problems.

In addition, Lanzi (2003) based classifier conditions on stack-based genetic programming (Perkis, 1994) and solved the 6-bit and 11-bit Multiplexer as well as Woods1 problems. Here, the conditions are linear sequences of tokens, expressed in Reverse Polish Notation, where each token represents either a variable, a constant, or a function. The function set used was composed of Boolean operators (i.e., AND, OR, NOT and EOR) and arithmetic operators (i.e., +, −, >, =).

Ahluwalia and Bull (1999) presented a simple form of LCS which used numerical S-expressions for feature extraction in classification tasks. Here each rule's condition was a binary string indicating whether or not a rule matched for a given feature and the actions were S-expressions which performed a function on the input feature value. More recently, Wilson (2008) has explored the use of a form of gene expression programming (GEP; Ferreira, 2006) within LCS. Here the expressions are composed of arithmetic functions and applied to regression tasks. The conditions are represented as expression trees which are evaluated by assigning the environmental inputs to the tree's terminals, evaluating the tree, and then comparing the result with a predetermined threshold. Whenever the threshold value is exceeded, the rule becomes eligible for use as the output.

Forsyth (1981) with his BEAGLE system was the first to use a purely evolution-based form of LCS (Pittsburgh style, Smith, 1983) to evolve LISP S-expressions for classification tasks. Landau et al. (2001) used a Pittsburgh-LCS in which the rules are represented as directed graphs where the genotypes are tokens of a stack-based language, whose execution builds the labeled graph. Bit strings are used to represent the language tokens and are applied to non-Markov problems. The genotype is translated into a sequence of tokens and then interpreted similarly to a program in a stack-based language with instructions to create the graph's nodes, connections, and labels. Subsequently, the unused conditions and actions in the stack are added to the structure which is then popped from the stack. Tokens are used to specify the matching conditions and executable actions as well as instructions to construct the graph, and to manipulate the stack. The bit strings were later replaced with integer tokens and again applied to non-Markov problems (Landau et al., 2005).

### 2.2 Graph-Based Genetic Programming

Most relevant to the form of GP used herein is the relatively small amount of prior work on graph-based representations. Neural programming (NP; Teller and Veloso, 1996) uses a directed graph of connected nodes, each performing an arbitrary function. Potentially selectable functions include READ, WRITE, and IF-THEN-ELSE, along with standard arithmetic and zero-arity functions. Additionally, complex user defined functions may be used. Significantly, recursive connections are permitted and each node is executed with synchronous parallelism for some number of cycles before an output node's value is taken.

Poli (e.g., Pujol and Poli, 1998) presented a similar scheme wherein the graph is placed over a two-dimensional grid and executes its nodes synchronously in parallel. Connections are directed upward and are only permitted between nodes situated on adjacent rows; however, by including identity functions, connections between nonadjacent layers are possible and thus any parallel distributed program may be represented.

Teller and Veloso (1997) also presented parallel algorithm discovery and orchestration (PADO) which uses an arbitrary directed graph of nodes and an indexed memory. Each node in the graph consists of an action and a branch-decision component, with multiple outgoing branches permitting the various potential flows of control. A stack is used from where each program's inputs are drawn and the results are pushed. The potentially selectable actions are similar to NP and include arithmetic operators, negation, minimum and maximum, and the ability to read from and write to the indexed memory, along with nondeterministic and deterministic branching instructions. The graphs are executed chronologically for a fixed amount of time with each node selecting the next to take control. The output nodes are then averaged, giving additional weighting to the more recent states.

Other examples of graph-based GP typically contain sequentially updating nodes, for example, finite state machines (e.g., Fogel et al., 1965), Cartesian GP (Miller, 1999), genetic network programming (Hirasawa et al., 2001), linear-graph GP (Kantschik and Banzhaf, 2002), and graph structured program evolution (Shirakawa et al., 2007). Schmidt and Lipson (2007) have recently demonstrated a number of benefits from graph encodings over traditional trees, such as reduced bloat and increased computational efficiency.

## 3 XCSF Overview

For each phase in the XCSF learning cycle (Wilson, 2001), a match set [M] is generated from the population set [P], composed of all of the classifiers whose environment condition matches the current environmental input. In the event [M] is empty, covering is used to produce classifiers that match the current environment state with random actions.

Subsequently, a system prediction is made for each action in [M], based upon the fitness-weighted average of all of the predictions of the classifiers proposing the action. If there are no classifiers in [M] advocating one of the potential system actions, covering is invoked to generate classifiers that both match the current environment state and advocate the relevant action. An action is then selected using the system predictions, typically by alternating exploring (by either roulette wheel or random selection) and exploiting (the best action). In multistep problems, a biased selection strategy is often employed wherein exploration is conducted at probability *p*_{explr}, otherwise exploitation occurs (Lanzi, 1999). An action set [A] is then built composed of all the classifiers in [M] advocating the selected action. Next, the action is executed in the environment and feedback is received in the form of a payoff, *P*.

In a single step problem, [A] is updated using the current reward. The genetic algorithm (GA; Holland, 1975) is then run in [A] if the average time since the last GA invocation is greater than the threshold value, . When the GA is run, two parent classifiers are chosen (typically by either roulette wheel or tournament selection) based on fitness. Offspring are then produced from the parents, usually by use of crossover and mutation. The offspring then have their payoff, error, and fitness set to the average of their parents’ values. If subsumption is enabled and the offspring are subsumed by either parent, it is not included in [P]; instead, the parents’ numerosity is incremented. In a multistep problem, the previous action set [A]_{-1} is updated using a Q-learning (Watkins, 1989) type algorithm and the GA may be run as described above on [A]_{-1} as opposed to [A] for single step problems. The sequence then loops until it is terminated.

*x*

_{0}. This extra weight is set as a constant value and is uniform across all classifiers in the population. That is, each classifier maintains a prediction (

*cl*.

*p*) which is calculated as a product of the environmental input (

*s*) and the classifier weight vector (

_{t}*w*):

*s*|

_{t}^{2}is the norm of the input vector

*s*. The values are used to update the weights of the classifier

_{t}*cl*with:

Giani et al. (1995) provide the first example of an LCS where the prediction is computed for each environment state, that is, the prediction can vary over the condition's domain. There, neural networks were used to compute the prediction values within a Pittsburgh LCS based on a Q-learning strategy. This enables a more accurate, piecewise linear approximation of the payoff (or function), as opposed to the standard piecewise constant approximation, and can also be applied to binary problems such as the Boolean multiplexer and maze environments, resulting in faster convergence to optimality. By computing the prediction, greater systemwide generalisations can also be formed, including within different payoff levels, potentially resulting in a more compact and general rule base. See Wilson (2002) for further details.

## 4 Dynamical Arithmetic Networks

The standard arithmetic operators (shown in Table 1) have become the default operators within genetic programming for regression tasks (Koza, 1992). The functions comprise the basic operational toolset within mathematics for transforming two numbers into a single product; because of this, most forms of genetic programming, whether tree-based (e.g., Koza) or graph-based (e.g., Miller and Thomson, 2000), use two fixed connections (i.e., *K*=2) to each node which act as inputs to be transformed by the receiving node's operator. In dynamical systems, *K*=2 has been identified as the critical regime, with higher connectivity resulting in increasing chaos (e.g., Kauffman, 1993). Significantly, arithmetic operators are unbounded, unlike fuzzy logic for example.

ID . | Function . | Logic . |
---|---|---|

0 | > | if(x>y) return 1.0; else return 0.0 |

1 | × | |

2 | + | x+y |

3 | − | x−y |

4 | / | x/y |

ID . | Function . | Logic . |
---|---|---|

0 | > | if(x>y) return 1.0; else return 0.0 |

1 | × | |

2 | + | x+y |

3 | − | x−y |

4 | / | x/y |

Therefore, to incorporate an arithmetic dynamical genetic programming scheme within XCSF (hereinafter, aDGP-XCSF, see, e.g., Figure 1), here *K*=2 and each node performs one of the potentially selectable operations (from Table 1) before being capped at a minimum or maximum node state of 10,000.0, which is necessary since the dynamical behaviour of the network could result in states of positive or negative infinity. Finally, the introduction of constants can be achieved similarly to traditional genetic programming through the use of ephemeral random constants; however, following Miller and Thomson (2000) and Clegg et al. (2007), here we start with only using one selectable constant of value 1.0.

Figure 2 illustrates the fraction of nodes changing state over time within a synchronous 13 node network (where the results are an average of 100 randomly constructed networks).

## 5 Arithmetic DGP in XCSF

To use dynamical arithmetic genetic networks as the rules within XCSF, the following scheme is adopted. The population of classifiers is fully initialised randomly. Each randomly created network initially consists of *N*_{init} number of nodes, with each node maintaining two randomly assigned connections, and where these connections are assigned to external inputs (i.e., input variables and constants) at a 20% uniformly random probability and to other nodes within the network at the remaining 80%; thus ensuring a consistent distribution as the number of nodes increases. In addition, each node is randomly assigned one of the aforementioned operators.

Node states are initialised at random for the first step of a trial but thereafter they are not reset for each subsequent matching cycle. Matching consists of synchronously executing each rule for *T* cycles based on the current input. An extra matching node is also required to enable a network to (potentially) only match specific sets of inputs. If a given network has a value of less than 0.5 on the match node, regardless of the state of its outputs, the rule does not join the match set, [M]. During exploitation, the single rule with the highest prediction multiplied by accuracy is chosen as the system output. During exploration, a single rule is chosen under a prediction proportionate scheme. Once a rule has been chosen, an action set, [A], is constructed, composed of all other matching rules whose output node states lie within of the chosen network's output node. Parameters are then updated as usual and the GA is executed in [A] during exploration. When covering is necessitated, a randomly constructed network is created and then executed for *T* cycles to determine the status of the match node. This procedure is repeated until a network is created that matches the environment state.

Following Preen and Bull (2009), each rule has its own mutation rate . Mutation only is used here and applied to the node's function and connectivity map at rate . A node's function is represented by an integer which references the appropriate operation to execute upon its received inputs (see Table 1 for the arithmetic operators used). Further, each node's connectivity is represented as a list of two integers, with positive integers referencing inputs to be received from other nodes in the network and negative integers referencing external inputs. Each integer in the list is subjected to mutation on reproduction at the self-adapting rate for that rule. Hence, within the representation, evolution can select different operators for each node within a given network rule, along with its connectivity map. Specifically, each rule has its own mutation rate stored as a real number and initially seeded uniform randomly in the range [0, 1]. This parameter is passed to its offspring. The offspring then applies its mutation rate to itself using a Gaussian distribution, that is, , before mutating the rest of the rule at the resulting rate. This is similar to the approach used in evolution strategies (ES; Schwefel, 1981) where the mutation rate is a locally evolving entity in itself, that is, it adapts during the search process. Self-adaptive mutation not only reduces the number of hand-tunable parameters of the evolutionary algorithm, it has also been shown to improve performance.

Due to the need for a possible different number of nodes within the rules for a given task, the DGP scheme is also of variable length. Within our system, once the function and connections have been mutated, a new randomly connected node is either added or the last added node is removed with the same probability . The latter case only occurs if the network currently consists of more than the initial number of nodes. Subsequently, parameter *T* (i.e., the number of execution cycles) undergoes mutation. Here, each rule maintains its own *T* value which is initially seeded randomly between 1 and 50. Thereafter, offspring potentially increment or decrement *T* by 1 at probability , and *T* remains bounded between 1 and 50. Thus, DGP is temporally dynamic both in the search process and the representation scheme. Traditional GP can be seen to primarily rely upon recombination to search the space of possible tree sizes, although the standard mutation operator effectively increases or decreases tree size also. Whenever an offspring classifier is created and no changes occur to its network when undergoing mutation, the parent's numerosity is increased and mutation rate is set to that of the offspring.

Furthermore, since XCSF computes the predicted value of a state–action pairing, each classifier maintains a vector of a series of weights, where there are as many weights as there are inputs from the environment, plus one extra, *x*_{0}. This extra weight is set as a constant value and is uniform across all classifiers in the population. Each of the input weights is initially set to zero, and subsequently adapted to accurately reflect the prediction using a modified rule. The modified rule provides a correction for each step that is proportional to the difference between the current and correct prediction, and controlled by a correction rate, . Following Wilson (2007), an extra weight is also included which receives as input the classifier's current action. In addition, here each offspring's vector weights are reset upon reproduction. Figure 3 shows an illustration of a rule generated whilst solving the sextic polynomial *f*(*x*)=*x*^{6}+2*x*^{4}+*x*^{2}. The rule has an error of 0 and an accuracy of 1 while having an experience of 685, showing that it is a highly accurate rule. The fitness is only 0.118 since it is shared among classifiers in the same niche.

## 6 Experimentation

### 6.1 Reinforcement Learning

We begin experimentation using two well-known reinforcement learning problems, the real-multiplexer and the frog problem. The 6-bit real multiplexer problem provides a continuous-input discrete-output task as a first step to understanding the capabilities of aDGP-XCSF; it demonstrates the handling of a multivariate problem and enables the comparison with prior work. The frog problem provides a fully continuous-input and output reinforcement learning task and enables the exploration of the applicability of aDGP-XCSF to continuous reinforcement learning.

#### 6.1.1 6-Bit Real Multiplexer

The Boolean multiplexer problem consists of binary strings of length *l*=*x*+2^{x} under which the *x* bits index into the remaining 2*x* bits, returning the value of the indexed bit. The real multiplexer problem (Wilson, 2000) is an extension of the Boolean multiplexer where the binary strings are replaced with randomly generated real-valued vectors in the range [0,1]. Each value in the vector is then interpreted as 0 if greater than a threshold value , else 1. Similar to Wilson (2000), here . In this experiment only, the output node is discretised to 0 or 1 depending on its state being greater or less than 0.5. The actions of each network are then used to construct a prediction array as in the standard XCS approach. Each node state is also restricted in the range [−1,1]. Training switches between explore and exploit trials. In each case, a random example is generated. Under explore trials, an action is chosen at random from within the matching set of rules [M]. Rules are updated and the GA may fire. Under exploit trials, the GA is not used.

From Figure 4a, it can be seen that optimal performance is achieved after approximately 100,000 trials. The learning speed is slower than XCSR (Wilson, 2000); however, here 100% performance is ultimately achieved, whereas with XCSR “performance reaches its maximum [at] approximately 98%” (Wilson, 2000). It is important to note that pure evolution such as that used here will, in general, require longer learning times than those able to exploit a statistical update procedure. However, here the representation brings benefits in terms of inherent memory (see, e.g., Preen and Bull, 2009) along with improved signal to symbol transformation and a number of other benefits, which are shown later.

Figure 4a also shows that prior to reaching optimality, the number of macro-classifiers increases from 500 to 1,000 and the average mutation rate declines from 45% to 3%; after 100,000 trials both values remain stable. Figure 4b shows that the average number of nodes increases marginally from 22 to 22.5, while the average value of *T* remains relatively stable throughout experimentation (30 to 28.5).

#### 6.1.2 Continuous-Action Frog Problem

*d*from the frog, where . The frog receives a sensory input,

*x*(

*d*)=1−

*d*, before jumping a chosen distance,

*a*, and receiving a reward based on its new distance from the fly, as given by:

In the continuous-action case, the frog may select any continuous number in the range [0,1] and thus the optimal achievable performance is 100%. Parameters are then updated and the GA executed as usual in [A]. Exploitation functions by selecting the single rule with the lowest error divided by fitness from [M]. The parameters used here are the same as used by Wilson (2004, 2007) and Tran et al. (2007), that is, *P*=2000, , , , , , , *x*_{0}=1. Only one output node is required and thus *N*_{init}=3.

Wilson (2007) presented a form of XCSF where the action was computed directly as a linear combination of the input state and a vector of action weights, and conducted experimentation on the continuous-action frog problem, selecting the classifier with the highest prediction for exploitation. Tran et al. (2007) subsequently extended this by adapting the action weights to the problem through the use of an ES. In addition to the action weights, a vector of standard deviations is maintained for use as the mutation step size by the ES. During exploration, the ES is applied to each member of [A] to evolve the action weights and standard deviations, where each rule functions as a single parent producing an offspring via mutation; the offspring is then evaluated on the current environment state and its fitness updated and compared with the parent; if the offspring has a higher fitness, it replaces the parent, otherwise, it is discarded. Moreover, the exploration action selection policy was modified from purely random to selecting the action with the highest prediction. After reinforcement updates and running the ES, the GA is invoked using a combination of mixed crossover and mutation. They reported greater than 99% performance after an averaged number of 30,000 trials (*P*= 2,000), which was superior to the performance reported by Wilson (2007). More recently, Ramirez Ruiz et al. (2008) applied a fuzzy-LCS with continuous vector actions, where the GA only evolved the action parts of the fuzzy systems, to the continuous-action frog problem, and achieved a lower error than Q-learning (discretized over 100 elements in *x* and *a*) after 500,000 trials (*P*=200).

Figure 5 shows the performance of aDGP-XCSF on the continuous-valued frog problem. As can be seen, aDGP-XCSF attains greater than 99% performance after approximately 8,000 trials. This is an improvement on previously reported results.

### 6.2 Regression/Function Approximation

To adapt aDGP-XCSF for regression tasks, several modifications are necessary. A trial now consists of an input from a dataset of real numbers, followed by the construction of [M], receiving the correct answer from the dataset, updating all classifiers in [M], and then running the GA in [M]. Performance is measured as the absolute error between the answer and the action from the single network with the lowest error divided by fitness.

Following Stalph and Butz (2010), who found that by increasing the number of reproduced classifiers, each GA invocation can increase the learning speed of XCSF in regression tasks, here a new parameter, , is introduced to control the number of offspring created from the two parents chosen through roulette wheel each GA invocation, with being equal to traditional LCS. As can be seen in the following experiments, this was necessary to increase the amount of search performed, because the number of possible phenotypes represented by aDGP is extremely large. Furthermore, following neural-XCS for regression (Bull and O'Hara, 2002), MAM updating of inexperienced classifiers is disabled. The dataset used in the following experiments consists of 50 equally spaced real-valued numbers in the range [−1,1] and the parameters used are: *P*=2000; ; ; ; ; ; *N*_{init}=2.

#### 6.2.1 Sextic Polynomial

The average absolute error using an unmodified GA (i.e., ) remains above 0.1 after 500,000 trials, with only one in 10 experiments achieving an error below (not shown). In contrast, with increased local search, the average error is reduced below after approximately 210,000 trials with 100 offspring created each GA invocation (also not shown), and approximately 125,000 trials with 250 offspring (see Figure 6a). In addition, the average time (in trials) taken to reach an error below with (*M*= 103890, *SD*= 80883, *N*= 10) is significantly greater than (*M*= 33520, *SD*= 33620, *N*= 10) using a two-sample *t*-test assuming unequal variances, *t*(12) = 2.54, .026, showing that aDGP-XCSF benefits from increased local search. As might be expected, the time to is slower than standard XCSF, which does not use pure evolution, requiring around 7,600 trials to achieve an error below the target threshold. Figure 6a shows the average number of macro-classifiers with initially declines from 2,000 to 1,450 over the first 50,000 trials before increasing and converging to around 1,800 after approximately 250,000 trials. Over the first 125,000 trials, where the average error is reduced below , the average mutation rate (also Figure 6a) declines rapidly from 50% to 6% and the average number of nodes in the networks (Figure 6b) grows from 2 to 22. Furthermore, from Figure 6b, it can be seen that the average value of *T* remains stable around 20 throughout experimentation.

#### 6.2.2 Quintic Polynomial

The quintic polynomial provides slightly less potentially exploitable regularity and modularity than the sextic previously considered (Koza, 1994). Figure 7a shows that the average absolute error of aDGP-XCSF is consistently below after 160,000 trials. Again, this is slower than standard XCSF, which reaches the target error after approximately 1,200 trials. Figure 7a shows an initial decline in the average number of macro-classifiers over the first 50,000 trials from 2,000 to 1,400 before converging to around 1,850, similar to the sextic problem. The average mutation rate (Figure 7a) declines from 50% to 8% after 160,000 trials while a system error below is achieved, stabilising around 5% thereafter. The average number of nodes in the networks (Figure 7b) grows from 2 to 20 after 160,000 trials and continues to grow throughout the experiment, reaching 37 after 500,000 trials. The average value of *T* remains around 25 throughout (see also Figure 7b).

#### 6.2.3 Two-Composite Polynomial

Figure 8 shows the performance of aDGP-XCSF on the two-composite polynomial with , while Figure 9 shows the performance of tree-based GP (*P*=10,000; MAX_LEN = 1,000; DEPTH = 5; CROSSOVER = 0.9; MUTATION PER NODE = 0.05) on the same problem. From Figure 8a, it can be seen that the average absolute error of aDGP-XCSF is consistently zero after approximately 200,000 trials. In contrast, tree-based GP attains a minimum average absolute error of 0.02, twice (i.e., sum of errors, 1.0, divided by dataset size, 50) after 500 generations (Figure 9a). Being generous to tree-based GP and assuming the average [M] set size is equivalent to the entire population size (in reality it is closer to half), 125,000 trials correspond to 500 generations (both composed of 250 million evaluations); the average absolute error after 125,000 trials of aDGP-XCSF with (*M*= 0.006, *SD*= 0.0062, *N*= 10) is significantly less than tree-GP after 500 generations (*M*= 0.02, *SD*= 0.019, *N*= 10) using a two-sample *t*-test assuming unequal variances, *t*(11) =−2.25, .0456. Standard XCSF reaches the target threshold after approximately 5,400 trials.

The average number of macro-classifiers used by aDGP-XCSF converges to around 1,850 after 250,000 trials while the average mutation rate declines to around 9% (Figure 8a). The average value of *T* utilised by the aDGP-XCSF networks remains around 30 throughout experimentation (Figure 8b). As might be expected (e.g., Schmidt and Lipson, 2007), the average number of nodes in the networks used by graph-based aDGP-XCSF (16 after 125,000 trials; see Figure 8b) is fewer than the average number of nodes used by tree-based GP (35,000 after 500 generations; see Figure 9b).

#### 6.2.4 Four-Composite Polynomial

Figure 10 shows the performance of aDGP-XCSF on the four-composite polynomial with , while Figure 11 shows the performance of tree-based GP. A lower value is used here because the four-composite problem requires more niching and larger values cause too much of the niche to be replaced through GA activity which results in performance spikes (not shown). From Figure 10a, it can be seen that the average absolute error of aDGP-XCSF is consistently below after approximately 80,000 trials. In contrast, tree-based GP attains a minimum average absolute error of 0.02, twice (Figure 11 a). Again, being generous to tree-based GP and assuming the average [M] set size is equivalent to the entire population size, 125,000 trials correspond to 500 generations; the average absolute error after 125,000 trials of aDGP-XCSF with (*M*= 0.004, *SD*= 0.0027, *N*= 10) is significantly less than tree-GP after 500 generations (*M*= 0.0197, *SD*= 0.0173, *N*= 10) using a two-sample *t*-test assuming unequal variances, *t*(9) =−2.82, .02. Standard XCSF reaches the target error after approximately 2,100 trials.

The average number of macro-classifiers (Figure 10a) rapidly decreases from 2,000 to 1,450 after 10,000 trials before steadily converging to around 1,750. The average mutation rate (also Figure 10a) declines from 50% to 13% over the first 80,000 trials while solutions are learned and then declines at a slower rate to around 6.5% after 500,000 trials. The average value of *T* utilised by the aDGP-XCSF networks remains around 26 throughout experimentation (Figure 10b). Similar to the two-composite function, the average number of nodes in the networks used by graph-based aDGP-XCSF (12 after 125,000 trials; see Figure 10b) is fewer than the average number of nodes used by tree-based GP (23,000 after 500 generations; see Figure 11b).

Figure 12 shows the matching classifiers on the four-composite polynomial problem where each classifier has an error less than 10% of (i.e., the 15 lowest error matching rules in the population). In addition, the composite function is plotted above, showing that XCSF correctly partitions the input space into four separate niches with distinct matching classifiers.

## 7 Look-Ahead Learning

### 7.1 Anticipatory LCS

Samuel (1959) showed that by generating an internal model of the environment, the system can make predictions about the expected consequences of various sequences of action, that is, it can look ahead. To incorporate future state predictions into LCS, Holland (1990) proposed that, in addition to a condition and action, each classifier also calculates the effect of performing the proposed action. Riolo (1990) extended this to perform such learning without external reinforcement, calculating the next-state rather than the next-reward, that is, latent learning. Stolzmann (1998) presented a heuristic-driven LCS (ACS) which uses the explicit next-state rule structure to build anticipatory models of the environment where the accuracy of the rules’ predictions are factored into their utility. Through anticipating the consequences of actions with the evolving model, system behaviour can adapt faster. Similarly, YACS (Gerard and Sigaud, 2001) performs the same anticipatory learning; however, it modifies the condition and effect separately with the goal of easing over-specialised conditions. Zatuchna and Bagnall (2005) incorporated memory within an ACS-like approach and found faster convergence to optimality in non-Markov mazes when compared with LCS using explicit memory. Holley et al. (2004) extended ACS to use incomplete information contained in the classifier list as a basis for an abstract world model in which to interact or dream. They found that the abstract thread (or dream direction) can be used to cycle well-known states, resulting in fewer interactions with the environment to develop a confident model in a simple maze environment. With the goal of discovering new regularities, Gerard et al. (2005) proposed a version which included *don't know* symbols wherein a classifier may anticipate a few attributes only. Since a single classifier describes only a partial view of the next situation, the anticipating unit is composed of the entire LCS instead of a single rule. ACS was later extended to incorporate a GA for generalisation (ACS2) which resulted in improved performance (Butz and Stolzmann, 2002).

LCSs, which use rule linkage over succeeding time steps (e.g., Tomlinson, 2001), may also implicitly build predictions of future states when the condition of a linked rule represents the next state. Bull (2002a) cast the internal model building task as a single-step task within ZCS (Wilson, 1994), where reward is given only if a rule predicts the expected outcome of taking its action under the condition matched. Bull and Hurst (2003) explored a ZCS where each rule is embodied as a neural network with separate output nodes for each condition, action, and anticipation. In addition, O'Hara and Bull (2005) encompassed two neural networks within each XCS classifier (Wilson, 1995) to solve a number of discrete Markov mazes; one network was used to calculate the current matching condition and action, and a second (trained via backpropagation) to produce a description of the anticipated next state. More recently, Bull et al. (2007) found competitive performance using a simple array of perceptions to provide the anticipation mappings. To date, all work on anticipatory LCS has only considered discrete-valued problems.

### 7.2 Multistep-Ahead Prediction Neural Networks

In addition to the common single step-ahead prediction task, neural networks have been used to construct an *H*-steps ahead prediction. The various approaches to designing a multistep-ahead prediction (MSP) can be broadly categorised as either iterative or direct. The iterative approach is the oldest technique for MSP and involves iterating, *H* times, a one-step-ahead predictor where the output is fed back as input to produce the next step prediction, that is, estimated values are used as inputs instead of actual observations and thus a propagation of error is inherent, which may result in low performance. This is particularly significant on long-horizon tasks because the models are tuned with one-step criteria and consequently do not take the temporal behaviour into account appropriately. Typically, recurrent neural networks are used in iterative approaches (e.g., Williams and Zipser, 1989). In order to correct the propagation of error during training, frequently in dynamical supervised learning tasks, the network output is replaced with the corresponding desired response (i.e., target signal) wherever one is available for the subsequent computation of the dynamic behaviour of the network, that is, teacher forcing (Williams and Zipser, 1989).

Direct approaches include the use of multiple prediction models (i.e., *H* number of networks), and multiple-input multiple-output (MIMO) models (i.e., vector of outputs). Estimating *H* prediction models is a much higher functional complexity than an iterated approach. In addition, direct models learned independently induce a conditional independence of the estimators, preventing the technique from considering complex dependencies between the variables and consequently biasing the prediction accuracy (Ben Taieb et al., 2010). Multivariate responses prediction consists of *H* number of output nodes per network, “with the goal of preserving, among the predicted values, the stochastic dependency characterising the time series” (Ben Taieb et al., 2010, p. 1950, i.e., the relationship between future values is captured). However, MIMO approaches can suffer from too tight a coupling among the outputs (Huang and Lian, 2000) and require considerable training time (Selvaraj et al., 1995).

## 8 Arithmetic DGP-XCSF Look Ahead Learning

The DGP computation of the first step-ahead prediction remains as before, that is, each network is processed for *T* cycles before sampling the match and output nodes. Further predictions are computed by iteratively sampling the output nodes each subsequent *W* cycles, that is, each matching network is processed a total of *T*+(*W*(*H*−1)) cycles. To maintain uniform processing of all networks, after the classifiers have been updated and the GA (potentially) run, each matching network's nodes are reset to the final states after computing the first step prediction (i.e., the state of the network after *T* cycles).

To perform an *H*-steps-ahead online forecast, here each classifier maintains an |*H*| by |*H*| matrix representing the last *H* number of *H*-step predictions. In addition, the system as a whole stores the previous *H* match sets. Upon receiving the current time step environmental input, *P*(*t*_{0}), each of the match sets performs a single update similar to traditional XCSF, where each classifier in [M]_{-i} is updated based on the absolute error of the current input and the classifier's *i*th step forecast, , that is, (see Figure 13 for an example where *H*= 4). In this way, *t*_{-1} one-step-ahead forecasts are updated, and *t*_{-2} two-step-ahead forecasts are updated, and so on. In addition, the previous time-delayed inputs are embedded to create a fixed length memory buffer, presenting the inputs simultaneously.

We compare this approach with an XCSF with *H* weight vectors to perform direct learning of future steps. The condition is composed of of interval pairs; when the GA is invoked, each offspring's interval is mutated by adding a , where *Random Number*() is a real-valued random number in the range [−1,1], and is the same self-adaptive mutation procedure used within DGP; no macro-classifiers are used. To provide the supervised update, the past *H* state vectors are also maintained by the system.

## 9 Experimentation

*P*is the price being averaged and

*N*is the number of days in the moving average. Moreover, the SMA can be applied to price proxies such as the

*typical price*(i.e., (

*P*

_{high}+

*P*

_{low}+

*P*

_{close})/3) as well as other mathematical technical indicators. The problem with an SMA is that whilst it affords excellent price smoothing, it is a very lagging indicator, with the specified period length (

*N*) proportional to the time lag in its signal. Furthermore, each data value counts twice in the calculation; once initially, as the new information is added to the average, and again at the end when the value is removed in order to make way for the new information. This double counting of data values can result in the average either rising or falling despite the most recent data being contrary.

*K*=2/(

*N*+1),

*N*is the number of days in the EMA,

*P*

_{today}is today's price and is the EMA of yesterday.

Beyond the SMA and EMA, there have been numerous research studies within the field of digital signal processing seeking to reduce the lag and improve the smoothness of the averages, for example, the adaptive moving average (AMA; Kaufman, 1995). The central premise of AMA is that in fast trending markets, a smaller *N* should be used to calculate the average to maintain a low lag, and in slower moving markets, a larger value should be used to maintain smoothness.

The target time series for our experiments is the single most liquid financial instrument in the world, trading approximately $1 trillion a day, the Euro/U.S. dollar currency pair (EURUSD). Since financial time series are widely acknowledged as being extremely noisy, instead of forecasting the future values of the series directly, here we use an EMA as a simple smooth price proxy with which to predict the trend. In addition, the EMA is calculated using the typical price instead of the closing price since currency markets are open 24 hr per day and there is therefore no psychological importance to the closing price. Further, different brokers around the world calculate the closing price slightly differently depending on the local time shift.

Figure 14 illustrates the typical price and the 50-day EMA of EURUSD used for experimentation. Initial learning is conducted by looping over the training set (i.e., the first 179 days, in sequence) for 500 iterations. Thereafter, a single pass of the conjoining (98 day) test set is conducted (including classifier updates and GA invocation) to evaluate the performance on unseen data in an online manner. Table 2 describes the training and testing periods. In each of the following experiments, the results are an average of 10 runs.

Training Set | Test Set | ||

Mean | 1.32059902 | Mean | 1.399067279 |

Standard error | 0.001389333 | Standard error | 0.002176917 |

Median | 1.31706851 | Median | 1.399351445 |

Standard deviation | 0.018588011 | Standard deviation | 0.021550375 |

Sample variance | 0.000345514 | Sample variance | 0.000464419 |

Kurtosis | −0.744922569 | Kurtosis | −0.824531903 |

Skewness | 0.21951874 | Skewness | 0.071625332 |

Range | 0.07788101 | Range | 0.08572053 |

Minimum | 1.28736279 | Minimum | 1.35558296 |

Maximum | 1.3652438 | Maximum | 1.44130349 |

Sum | 236.3872246 | Sum | 137.1085933 |

Count | 179 | Count | 98 |

Training Set | Test Set | ||

Mean | 1.32059902 | Mean | 1.399067279 |

Standard error | 0.001389333 | Standard error | 0.002176917 |

Median | 1.31706851 | Median | 1.399351445 |

Standard deviation | 0.018588011 | Standard deviation | 0.021550375 |

Sample variance | 0.000345514 | Sample variance | 0.000464419 |

Kurtosis | −0.744922569 | Kurtosis | −0.824531903 |

Skewness | 0.21951874 | Skewness | 0.071625332 |

Range | 0.07788101 | Range | 0.08572053 |

Minimum | 1.28736279 | Minimum | 1.35558296 |

Maximum | 1.3652438 | Maximum | 1.44130349 |

Sum | 236.3872246 | Sum | 137.1085933 |

Count | 179 | Count | 98 |

### 9.1 Two-Step-Ahead Prediction

In each of the following experiments, *P*=10,000, , , , , , , , , *N*_{init}=20, and three ephemeral random constants [−0.001, +0.001] are used. For XCSF, and *x*_{0}=1.0.

From Figure 15a, it can be seen that after 500 iterations of the training set aDGP-XCSF with achieves an average absolute error over a two-step-ahead prediction of . With additional embedded inputs, aDGP-XCSF can learn faster and with more accurate solutions, where (Figure 15b) achieves an equivalent error of (). This is confirmed after a single evaluation of the test set with average error for , and for ().

In comparison, the average absolute errors after 500 training iterations of aDGP-XCSF with and are significantly less than XCSF with (Figure 16a; and respectively). Adding extra memory to XCSF (i.e., ; see Figure 16b) resulted in no statistical difference after 500 training iterations. However, over the test set, XCSF with () is significantly less than XCSF with (), . By comparing XCSF and aDGP-XCSF over the test set, it can be seen that aDGP-XCSF with both and achieve lower errors than XCSF and these are statistically significant ( and ).

### 9.2 Five-Step-Ahead Prediction

To perform the next five-steps-ahead prediction, identical parameters are used, except . From Figure 17a, it can be seen that after 500 iterations of the training set, aDGP-XCSF with achieves an average absolute error over the five steps of . Again, with additional embedded inputs, aDGP-XCSF performance is improved, where (Figure 17b) achieves an equivalent error of (). However, over the test set, there is no statistically significant difference (with achieving and achieving average errors).

The average absolute errors after 500 training iterations of aDGP-XCSF with and are significantly less than XCSF with (Figure 18a; and , respectively). Similar to the two-step-ahead prediction, adding extra memory to XCSF (; Figure 18b) resulted in no statistical difference after 500 training iterations. However, over the test set, XCSF with () is significantly less than XCSF with (), . By comparing XCSF and aDGP-XCSF over the test set, it can be seen that aDGP-XCSF with both and achieve lower errors than XCSF, and these are statistically significant ( and ).

## 10 Conclusions

This paper has explored Dynamical Genetic Programming (DGP), a temporally dynamic graph-based representation, within the XCSF LCS. The DGP syntax presented consists of each node receiving two inputs from an unrestricted topology and then performing an arbitrary function. The representation is evolved under a self-adaptive and open-ended scheme, allowing the topology to grow to any size to meet the demands of the problem space. The collective mechanics of dynamical arithmetic networks have been shown to be exploitable in solving continuous-valued input-output reinforcement learning problems, with superior performance to those reported previously in the frog problem.

The GA was modified to produce an increased number of offspring for each invocation beyond the traditional two. The GA modification was found to result in an increased local search that significantly benefited DGP which encapsulates a large number of phenotypes. XCSFs, utilising rules comprised of arithmetic genetic networks, were then shown capable of finding optimal performance for symbolic regression on a number of polynomial functions, and significantly superior performance and more compact solutions were found in a number of composite polynomial functions when compared with a benchmark tree-based genetic programming scheme.

Finally, it has been shown possible to exploit the collective emergent behaviour of ensembles of dynamical arithmetic networks to perform a multistep-ahead prediction of the EURUSD 50-day EMA, where the average error across the predictions was significantly less than XCSF with interval predicates.

Traditional approaches of iterating a one-step prediction typically result in a low performance model of a multistep-ahead prediction. Direct approaches suffer from the inability to capture the stochastic dependency between the future predictions and multiple-input multiple-output models suffer from too tight a coupling between the outputs. Here we have shown that by iterating each matching network, sampling the outputs each *W* cycles, and providing online reinforcement, dynamical genetic networks can provide a multistep-ahead prediction. The symbolic nature of the representation enables the modelling of relationships between sensory input, and here between the antecedent inputs. In addition, interval conditions assume that patterns will reoccur over the exact same state space and cannot generalise outside of the trained space, whereas symbolic conditions can learn to match based on the shape of the temporal dynamics.

It should be noted that with the increased capability resulting from more complex representations, the time to convergence can be dramatically slower than standard XCSF. The performance found by DGP, however, is generally similar to MLP-based neural-XCSF, yet with an improved signal to symbol transformation. Future work will continue to explore and identify problems wherein the additional expressiveness of DGP is most beneficial, such as those requiring memory and temporal prediction.

## References

*Lecture notes in computer science*, Vol.

*Lecture notes in computer science*, Vol.

*Lecture notes in computer science*, Vol.