## Abstract

A reflex is a simple closed-loop control approach that tries to minimize an error but fails to do so because it will always react too late. An adaptive algorithm can use this error to learn a forward model with the help of predictive cues. For example, a driver learns to improve steering by looking ahead to avoid steering in the last minute. In order to process complex cues such as the road ahead, deep learning is a natural choice. However, this is usually achieved only indirectly by employing deep reinforcement learning having a discrete state space. Here, we show how this can be directly achieved by embedding deep learning into a closed-loop system and preserving its continuous processing. We show in z-space specifically how error backpropagation can be achieved and in general how gradient-based approaches can be analyzed in such closed-loop scenarios. The performance of this learning paradigm is demonstrated using a line follower in simulation and on a real robot that shows very fast and continuous learning.

## 1 Introduction

Reinforcement Learning (Sutton & Barto, 1998) has enjoyed a revival in recent years, significantly surpassing human performance in video games (Deng et al., 2009; Guo, Singh, Lee, Lewis, & Wang, 2014). Its success is owed to a combination of variants of Q learning (Watkins & Dayan, 1992) and deep learning (Rumelhart, Hinton, & Williams, 1986). This approach is powerful because deep learning is able to map large input spaces, such as camera images or pixels of a video game, onto a representation of future rewards or threats, which can then inform an actor to create actions as to maximize such future rewards. However, its speed of learning is still slow, and its discrete state space limits its applicability to robotics.

Classical control, on the other hand operates in continuous time (Phillips & Harbor, 2000), which potentially offers solutions to the problems encountered in discrete action space. Adaptive approaches in control develop forward models where an adaptive controller learns to minimize an error arising from a fixed feedback controller (e.g., proportional integral derivative (PID) controllers) called “reflex” in biology. This has been shown to work for simple networks (Klopf, 1986; Verschure & Coolen, 1991) where the error signal from the feedback loop successfully learns forward models of predictive (reflex) actions. In particular, there is a rich body of work in the area of movement control and experimental psychology where a subject needs to work against a disturbance—for example, pole balancing with hierarchical sensory predictive control (HSPC) (Maffei, Herreros, Sanchez-Fibla, Friston, & Verschure, 2017) or grasping of different objects (Haruno, Wolpert, & Kawato, 2001). One of the earliest models is feedback error learning (FEL), where the error is actually just the control output of the feedback controller itself, which then computes a forward model by using a “distal” or earlier signal such as the impact or a cue (Miyamoto, Kawato, Setoyama, & Suzuki, 1988). However, based on biological anticipatory control, there is mounting evidence that the brain also predicts future perceptual events and monitors its task performance in this way (Maffei et al., 2017; Popa & Ebner, 2018), which we will also honor in this work. Nevertheless, both HSPC (Maffei et al., 2017) and FEL have the drawback that they employ only single-layer networks where an error signal trains neurons with the help of a heterosynaptic learning rule.

In a more technical context, such a network also employing sensor predictions was able to improve the steering actions of a car where a nonoptimal hard-wired steering is then quickly superseded by a forward model based on camera information of the road ahead (Porr & Wörgötter, 2006; Kulvicius, Porr, & Wörgötter, 2007). Such learning is close to one-shot learning in this scenario because at every time step, the error signal from the PID controller is available and adjusts the network (Porr & Wörgötter, 2006). In these learning paradigms the error signal is summed up with the weighted activations of neurons to generate an action command for both the reflex and learning mechanism (Porr & Wörgötter, 2006). This has the advantage that the error signal also has a behavioral meaning, but the immediate summation of the error with the activations results in the loss of information, which means that the system is much more constrained, so it cannot be extended to deeper structures. This is reflected in Kulvicius et al. (2007) in a dedicated chained architecture, which shows that the design of the network topology is constrained because of the merging of the error signal with the activation. Thus, so far, these fast-learning correlation-based networks could not easily be scaled up to arbitrary deeper structures and consequently had limited scope.

A natural step is to employ deep learning (Rumelhart et al., 1986) instead of a shallow network to learn a forward model. If we directly learn a forward model with the deep network mapping sensor inputs to actions, then we no longer need a discrete action space. This will then allow potentially much higher learning rates because the error feedback will be continuous as well. In order to achieve this, we need to define a new cost function for our deep network, which is defined within the closed-loop framework benchmarking the forward model in contrast to a desired output.

In this letter, we present a new approach for direct use of deep learning in a closed-loop context where it learns to replace a fixed feedback controller with a forward model. We follow the line of argumentation by Maffei et al. (2017) that anticipatory actions can be controlled by predictive sensory signals and that these are superior to motor anticipation because they can generate those actions based on predictive sensory cues alone. We show in an analytical way how to use the Laplace/z-space to solve backpropagation in a closed loop system. We then apply the solution first to a simulated line follower and then to a real robot where a deep network learns fast to replace a simple fixed PID controller with a forward model.

## 2 The Learning Platform

Hence, the aim of the learning loop is to fend off $D$ before it has disturbed the state of the robot. To that end, this loop receives $D$ via the predictive environment $QP$ and in advance of the reflex loop. This provides the learning unit with predictive signals $Pi$, and given its internal parameters $\omega $, a predictive action is generated as $AP=N(Pi,\omega )$.

During the learning process, $AP$ combined with $AR$ and $Dz-T$ travels through the reflex loop, and $Ec$ is generated. This error signal provides the deep learner $N$ with a minimal instructive feedback. Upon learning, $AP$ fully combats $D$ on its arrival at the reflex loop (i.e., $Dz-T$); hence, the reflex mechanism is no longer evoked and $Ec$ is kept at zero.

## 3 Closed-Loop Dynamics

^{1}

*learning*entails the adjustment of the internal parameters of the learning unit $\omega $ so that $Ec$ is kept at zero. To that end, the closed-loop cost-function $Cc$ is defined as the square of absolute $Ec$:

## 4 Toward Closed-Loop Error Backpropagation

## 5 The Inner Workings of the Deep Learner

Having explored the closed-loop dynamics, we now focus on the inner working of the learning unit. The latter partial derivative in equations 3.4 and 4.2, termed the network gradient $GN$, is merely based on the inner configuration of the learning unit, which in this work is a deep neural network (DNN) with backpropagation (BP). Given that the network is situated in the closed-loop platform, its dynamics are expressed in z-space.

^{2}$L$ and $I$ denote the total number of hidden layers and the total number of neurons in $\u2113$th layer, respectively. $\omega ij\u2113$ denotes the weights of the neurons in the z-domain, which are treated as constant since their rate of change is considerably slower in the time domain. We can formulate network gradient ($GN$) with respect to specific weights of the network using equation 5.1:

This concludes the derivation and formulation of our closed-loop deep learning (CLDL) paradigm. It is worth noting that CLDL is an online learning platform where the robot learns while driving and navigating through the environment. This is fundamentally different from conventional offline learning, where an agent is trained first and merely recalls the trained information when in use.

## 6 Results

The performance of our CLDL paradigm is tested using a line follower in simulation and through experiments with a real robot. The learning paradigm was developed into a bespoke low-level C++ external library (Daryanavard & Porr, 2020a). The transfer function of the reflex loop $TR$, resulting from equation 3.5, is set to unity for the results.

### 6.1 Real Robot Experiments

#### 6.1.1 Robot Configuration

#### 6.1.2 Closed-Loop Error

#### 6.1.3 Predictors

#### 6.1.4 Filter-Bank

The predictive signals are filtered so as to cause the correct delay for optimum correlation with the closed-loop error signal. The specifications of these filters depend on environmental parameters and are obtained through a simple experiment. The robot is placed on a straight path with a disturbance ahead (a bend), which is shown as a star sign in Figures 3A and 3B. The robot is switched on and moves forward with a constant velocity of $V0=5[cms]$ with the steering ability deactivated. The disturbance first appear, at position $\u25efa$ at time $t0$ and is sensed by the predictor farthest from the robot, $Pfar$. The disturbance next appears at position $\u25efb$ and is picked up by a the predictor nearest the robot, $Pnear$, at time $t1$. Finally, in position, $\u25efc$ the disturbance is sensed by the light sensors, which generate an error signal $Ec$ at time $t2$. These signals, $Pfar$, $Pnear$, and $Ec$ are shown in a timeline in Figure 3C. In order to cause an optimum correlation between the predictors and the error signal a maximum delay of $Tmax=t2-t0$ and a minimum delay of $Tmin=t1-t0$ is needed. These time delays are determined by the number of samples between the events, and given the sampling rate of 33 Hz we find that $Tmax=0.4[s]$ and $Tmin=0.2[s]$, as shown in Figure 3C. Thus, a bank of five second-order low-pass filters, $FB$, is designed with damping coefficients of $Q=0.51$ and impulse responses with peaks from 0.2 to 0.4 seconds.

#### 6.1.5 CLDL Algorithm

Although the line-following task may not use the power of a deep neural network, this serves purely to benchmark the practicality and flexibility of our CLDL algorithm for use in both shallow (as will follow in the simulations section) and deep neural networks. An increase in the number of hidden layers is often associated with vanishing and exploding gradients that hinder the learning and adversely affect the performance of the network (Pascanu, Mikolov, & Bengio, 2013; Bengio, Simard, & Frasconi, 1994; Bengio, Frasconi, & Simard, 1993). The issues that emerge from these gradients are often resolved by deliberate normalization of the weights and inputs, as well as careful manipulation of the computational units (neurons) (Pascanu et al., 2013). In this work we aim to present a convincing and authentic benchmark without reliance on such manipulation. Thus, we have experimentally arrived at a square-like structure for the deep network (containing 11 neurons in each 11 hidden layer) that combats the effect of vanishing and exploding gradients with no internal tuning of the weights.

#### 6.1.6 Steering

#### 6.1.7 Reflex Trial

#### 6.1.8 Learning Trials

In a learning trial, the robot navigates using both the reflex and the predictive action of the network, as formulated in equation 6.3. In the context of learning, “success” refers to a condition where the closed-loop error shows a minimum of 75% reduction from its average value during reflex trials, for three consecutive seconds (or 100 samples). Figure 5B shows the error signal during one learning trial where ($\eta =2\xb710-1$); this shows a strong reduction of the error signal over the first 50 seconds, where the learning is achieved rapidly. The closed-loop error acts as a minimal instructive feedback for the deep learner.

Figure 5C shows the final distribution of the weights in the first layer assigning different strength to different predictor signals. This is an 11 by 240 matrix of weights in the first layer showing the input index $Ui$ on the $x$-axis and the neuron index on the $y$-axis (refer to Figure 4 for the configuration of the input layer). The inputs that are generated from each row of predictors are organized into blocks separated by vertical lines (refer to Figure 3B for the location of these predictors). The six predictors in each row are filtered by a bank of five filters, which results in 30 inputs, and a total of 330 weights in each block. It can be seen that the weight distribution closely follows the positioning of predictors with weights assigned to the outermost column of predictors, $P6,12,\u2026,42,48$, having high values (black) and weights assigned to the innermost column of predictors, $P1,7,\u2026,37,43$, having small values (white) to allow for a combination of abrupt and subtle steering, respectively.

Figure 5D shows the weight change in each hidden layer. The purpose of this is to closely inspect the contribution of each hidden layer to the overall stability and convergence of the network. All layers show a stable increase in their weight change before they converge to their final value. The weight distance changes noticeably over the first 50 seconds, dictated by the closed-loop error, but arrives at a stable plateau as the error signal remains at zero.

#### 6.1.9 Tracking

#### 6.1.10 Statistics and Reproducibility

The performance of the deep learner was repeated with five different random weight initializations using different random seeds $srand(i)$ where $i={0,1,2,3,4}$. The learning rate was kept constant for these trials, $\eta =2\xb710-1$. Figure 7A shows that different random initialization of the weights makes no significant difference to the time that it takes for the learner to meet the success condition. The learning trial was repeated with five learning rates $\eta :{2\xb710-3,2\xb710-2.5,$$2\xb710-2,2\xb710-1.5,2\xb710-1}$; each experiment was repeated five times for reproducibility. Figure 7B shows the time taken for the robot to meet the success condition for these trials. This data show an exponential decay of the success time as the learning rate is increased.

### 6.2 Simulations with Virtual Robot

A virtual robot was designed using a simulation environment developed using QT5 and coded in C++ (Porr & Daryanavard, 2020). This allowed rapid verification of a variety of algorithm parameters, which show that in the simulated noise-free environment a shallow network is sufficient. Most important, a virtual robot allowed us to statistically infer the success of the learning paradigm through a large number of runs, which would have been impractical using the real robot.

#### 6.2.1 The Virtual Robot

#### 6.2.2 Reflex Error

#### 6.2.3 Predictors

#### 6.2.4 Filter-Bank

The predictors are then filtered using a bank of five second-order low-pass filters ($Fs$), with damping coefficients of $Q=0.51$ and impulse responses with appropriate delays, with their peaks occurring at 0.1 to 0.3 seconds (3 to 10 samples with sampling rate of 33 Hz), so as to cause the maximum correlation between predictors and the error signal. The specifications of these filters were obtained using a simple experiment as described for the real robot experiments.

#### 6.2.5 The Shallow Learner

A feedforward network composed of fully connected layers was used in the same way as Figure 4, with only two hidden layers and with the output layer consisting of one output neuron, as shown in Figure 8C. The eight predictors are filtered as shown in Figure 4, resulting in 40 inputs to the network. Therefore, the network is configured with 40 input neurons, 2 hidden layers with 12 and 6 neurons, respectively, and a neuron in the output layer, giving a total of 59 neurons. The performance of the algorithm for deep neural networks was demonstrated in section 6.1; in this section, we show that this algorithm can also be applied to shallow networks; in particular, it is sufficient in noise-free simulation environments.

#### 6.2.6 Steering

#### 6.2.7 Reflex Trial

#### 6.2.8 Learning Trial

For evaluation of the deep learner, a learning trial is designed to be one where the robot navigated using the predictive action, beginning from the start point and for the same number of samples as the reflex trials (30 seconds). Figure 9B, which illustrates the performance of the deep learner, shows the closed-loop error when the learning is on ($\eta =1\xb710-2$). This is when the robot learns while navigating. The robot exhibits very fast learning ($\u2248$2 seconds), where the error signal is kept at, or close to, zero. Figure 9D shows the Euclidean distance of the weights in each layer from their initial random value. This shows a gradual increase from zero to its maximum during the course of one trial. Since the error signal is propagated as a weighted sum of the internal errors, all layers show a similar weight change.

Moreover, Figure 9C shows the final distribution of the first layer's weights in the form of a normalized grayscale map on completion of the learning, as shown Figure 5C. This shows a 12 (neurons) by 40 (inputs) matrix of weights in the first layer. The inputs generated to form each predictor sensor are grouped in blocks separated by vertical lines (see Figure 8A for the positioning of these predictors). Each predictor was filtered with a bank of five filters resulting in five inputs on the $x$-axis and 60 weights in the area of each block. The weights show an organized distribution, with greater weights (black) assigned to the outer predictors, $P2,5$, and smaller weights (white) assigned to the inner predictors, $P4,7,8$. This facilitates a sharper steering for the outer predictors, ensuring a smooth steering.

#### 6.2.9 Tracking

#### 6.2.10 Statistics and Reproducibility

A set of simulations was carried out with five learning rates: $\eta :{10-5,10-4,10-3,10-2,10-1}$. Each of the scenarios was repeated 10 times. Figure 10A shows the root mean square (RMS) of the error signal for each learning trial, as well as that of the reflex trials for comparison. All learning scenarios show a significantly smaller RMS error when compared to the reflex behavior; the error is reduced from around $9\xb710-2$ to around $2\xb710-2$ and lower. There is a gradual decrease in this value as the learning rate is increased. Smaller values of RMS error indicate both the reduction in the amplitude and the recurrence of the error signal.

## 7 Discussion

In this letter we have presented a learning algorithm that creates a forward model of a reflex employing a multilayered network. Previous work in this area used shallow (Kulvicius et al., 2007), usually single-layer, networks to learn a forward model (Nakanishi & Schaal, 2004; Porr & Wörgötter, 2006), and it was not possible to employ deeper structures. Model-free RL has been using more complex network structures such as deep learning by combining it with Q-learning, where the network learns to estimate an expected reward (Guo et al., 2014; Bansal, Akametalu, Jiang, Laine, & Tomlin, 2016). On first sight, this looks like two competing approaches because both use deep networks with error back-propagation. However, they serve different purposes, as discussed in Dolan and Dayan (2013) and Botvinick and Weinstein (2014), which lead to the idea of hierarchical RL, where RL provides a prediction error for an actor, which can then develop forward models.

In deep RL (Guo et al., 2014) and in our algorithm, we employ error backpropagation, a mathematical trick where an error/cost function is expanded with the help of partial derivatives (Rumelhart et al., 1986). This approach is appropriate for open-loop scenarios but for closed-loop approaches, one needs to take into account the endless recursion caused by the closed loop. In order to solve this problem, we have switched to the $z$-domain in which the recursion turns into simple algebra. A different approach has been taken by long short-term memory (LSTM) networks, where the recursion is unrolled and backpropagation in time is used to calculate the weights (Hochreiter & Schmidhuber, 1997), which is done offline, whereas in our algorithm the weights are calculated while the agent acts in its environment.

Deep learning is generally a slow-learning algorithm, and deep RL tends to be even slower because of the sparsity of the discrete rewards. Purely continuous or sampled continuous systems can be very fast because they have continuous error feedback so that in terms of behavior nearly one-shot learning can be achieved (Porr & Wörgötter, 2006). However, this comes at the price of forward models being learned from simple reflex behaviors wherein sophisticated planning can be achieved. For that reason, combining the model-free deep RL with model-based learning to have a slow and a fast system has been suggested (Botvinick et al., 2019).

Still, our new approach is a deep architecture, and though we have demonstrated it through a line follower robot, it inherits all advantages from standard deep learning, such as convolutional layers and the development of high-level features (Deng et al., 2009), such as receptive fields. These features can then be used to create much more specific anticipatory actions than simple single-layer networks used in motor control to date (Maffei et al., 2017).

Forward models play an important role in robotic and biological motor control (Wolpert & Kawato, 1998; Wolpert, Ghahramani, & Flanagan, 2001; Haruno et al., 2001; Nakanishi & Schaal, 2004), where forward models guarantee an optimal trajectory after learning. With our approach, this offers opportunities to learn more complex forward models with the help of deep networks and then combine them with traditional Q-learning to planning those movements.

In the context of forward models, we should note that our model, like the ones by Miyamoto et al. (1988), Porr and Wörgötter (2006), and Maffei et al. (2017) learn the forward model for only one situation but would fail when different forward models were required, for example, being able to manipulate different objects. This has been addressed by the MOSAIC Model by Haruno et al. (2001) where multiple pairs of forward and inverse controllers were learned. However, this is beyond the scope of this work.

## Notes

^{1}

For brevity, we omit the complex frequency variable (z).

^{2}

Subscripts refer to the neuron's index, and superscripts refer to the layer containing the neuron or weight.

## Acknowledgments

We offer our gratitude to Jarez Patel for his considerable technical and intellectual input to this work, Dave Anderson for aiding with the motion-capture system, and Bruno Manganelli for his initial contribution to the graphical user interface framework of the physical robot.