## Abstract

Sensorimotor tasks that humans perform are often affected by different sources of uncertainty. Nevertheless, the central nervous system (CNS) can gracefully coordinate our movements. Most learning frameworks rely on the internal model principle, which requires a precise internal representation in the CNS to predict the outcomes of our motor commands. However, learning a perfect internal model in a complex environment over a short period of time is a nontrivial problem. Indeed, achieving proficient motor skills may require years of training for some difficult tasks. Internal models alone may not be adequate to explain the motor adaptation behavior during the early phase of learning. Recent studies investigating the active regulation of motor variability, the presence of suboptimal inference, and model-free learning have challenged some of the traditional viewpoints on the sensorimotor learning mechanism. As a result, it may be necessary to develop a computational framework that can account for these new phenomena. Here, we develop a novel theory of motor learning, based on model-free adaptive optimal control, which can bypass some of the difficulties in existing theories. This new theory is based on our recently developed adaptive dynamic programming (ADP) and robust ADP (RADP) methods and is especially useful for accounting for motor learning behavior when an internal model is inaccurate or unavailable. Our preliminary computational results are in line with experimental observations reported in the literature and can account for some phenomena that are inexplicable using existing models.

## 1 Introduction

Humans develop coordinated movements that allow efficient interaction with the environment. Despite extensive research on the topic, the underlying computational mechanism of sensorimotor control and learning is still largely an open problem (Flash & Hogan, 1985; Uno, Kawato, & Suzuki, 1989; Harris & Wolpert, 1998; Haruno & Wolpert, 2005; Todorov & Jordan, 2002; Todorov, 2004, 2005; Bhushan & Shadmehr, 1999; Shadmehr & Mussa-Ivaldi, 1994; Wolpert & Ghahramani, 2000). Indeed, recent research findings, including model-free learning (Huang, Haith, Mazzoni, & Krakauer, 2011; Haith & Krakauer, 2013), the active regulation of motor variability (Renart & Machens, 2014; Wu, Miyamoto, Castro, Olveczky, & Smith, 2014; Cashaback, McGregor, & Gribble, 2015; Lisberger & Medina, 2015; Pekny, Izawa, & Shadmehr, 2015; Vaswani et al., 2015), and the presence of suboptimal inference (Beck, Ma, Pitkow, Latham, & Pouget, 2012; Bach & Dolan, 2012; Renart & Machens, 2014; Acerbi, Vijayakumar, & Wolpert, 2014), have challenged some of the traditional models of sensorimotor learning, potentially requiring the development of a new computational framework.

Several computational theories have been proposed to account for sensorimotor control and learning (Shadmehr & Mussa-Ivaldi, 2012). One widely accepted conjecture is that the central nervous system (CNS) selects trajectories so as to minimize a cost function (Flash & Hogan, 1985; Uno et al., 1989; Harris & Wolpert, 1998; Haruno & Wolpert, 2005; Todorov & Jordan, 2002; Qian, Jiang, Jiang, & Mazzoni, 2012). This perspective has inspired a number of optimization-based models of motor control over the past three decades. In early work, Flash and Hogan (1985) and Uno et al. (1989) proposed that the CNS coordinates movements by minimizing the time integral of the jerk or torque change. Although simulations under these theories are consistent with experimental results, it is not clear why and how the CNS would minimize these specific types of costs. Being aware of this difficulty, Wolpert and his coworkers (Harris & Wolpert, 1998; van Beers, Baraduc, & Wolpert, 2002; Haruno & Wolpert, 2005) suggested an alternative theory that the goal of the motor system is to minimize the end-point variance caused by signal-dependent noise. Later, Todorov and his colleagues (Todorov & Jordan, 2002; Todorov, 2004, 2005) considered sensorimotor control within the framework of linear quadratic regulator (LQR) and linear quadratic gaussian (LQG) theories and conjectured that the CNS aims to minimize a mixed cost function with components that specify both accuracy and energy costs. Despite the different interpretations of the cost, a common assumption in these frameworks is that the CNS first identifies the system dynamics and then solves the optimization or optimal control problem based on the identified model (Shadmehr & Mussa-Ivaldi, 1994; Wolpert, Ghahramani, & Jordan, 1995; Kawato, 1999; Todorov & Jordan, 2002; Liu & Todorov, 2007; Zhou et al., 2016). Indeed, this identification-based idea has been used extensively to study motor adaptation under external force field perturbations (Shadmehr & Mussa-Ivaldi, 1994; Bhushan & Shadmehr, 1999; Burdet, Osu, Franklin, Milner, & Kawato, 2001; Franklin, Burdet, Osu, Kawato, & Milner, 2003). Although these models can explain many characteristics of motor control, such as approximately straight movement trajectories and bell-shape velocity curves (Morasso, 1981), there is no compelling experimental evidence as to how the CNS manages to generate a perfect internal representation of the environment in a short period of time, especially for complex environments.

Huang et al. (2011) and Haith and Krakauer (2013) proposed a different learning mechanism, known as model-free learning, to explain sensorimotor learning behavior. Some well-known experimentally validated phenomena, such as savings, could be attributed to this learning mechanism. Huang et al. (2011), Huberdeau, Krakauer, and Haith (2015), and Vaswani et al. (2015) studied these experimental results via reinforcement learning (RL) (Sutton & Barto, 2018), a theory in machine learning that studies how an agent iteratively improves its actions based on the observed responses from its interacting environment. The study on RL was originally inspired by the decision-making process in animals and humans (Minsky, 1954). Doya (2000) discussed that certain brain areas can realize the RL and suggested a learning scheme for the neurons based on temporal difference (TD) learning (Sutton, 1988). Izawa, Rane, Donchin, and Shadmehr (2008) used an actor-critic-based optimal learner in which an RL scheme was proposed to directly update the motor command. A possible shortcoming of traditional RL is that discretization and sampling techniques are needed to transform a continuous-time problem into the setting of discrete-time systems with discrete-state-action space, which may be computationally intensive. Moreover, rigorous convergence proofs and stability analysis are usually missing in the related literature.

Another discovery that has challenged the traditional motor learning framework is that the CNS can regulate, and even amplify, motor variability instead of minimizing its effects (Renart & Machens, 2014; Wu et al., 2014; Cashaback et al., 2015; Lisberger & Medina, 2015; Pekny et al., 2015; Vaswani et al., 2015). Wu et al. (2014) and Cashaback et al. (2015) conjectured that this puzzling phenomenon is related to the use of RL in sensorimotor learning. Motor variability facilitates the exploration phase in RL and, as a result, promotes motor learning. The importance of motor variability was also illustrated in Pekny et al. (2015) by showing that the ability to increase motor variability is impaired in patients with Parkinson's disease. Despite these experimental results, there still lacks a convincing theoretical analysis that can justify the need to regulate motor variability.

Finally, it has been reported recently (Beck et al., 2012; Bach & Dolan, 2012; Renart & Machens, 2014; Acerbi et al., 2014) that motor variability, traditionally thought of as a consequence of the internal noise characterized by neural variation in the sensorimotor circuit (Harris & Wolpert, 1998; van Beers, 2007; Faisal, Selen, & Wolpert, 2008; Chaisanguanthum, Shen, & Sabes, 2014; Herzfeld, Vaswani, Marko, & Shadmehr, 2014), can also arise through suboptimal inference. Beck et al. (2012) argued that suboptimal inference, usually caused by modeling errors of the real-world environment, should be the dominant factor in motor variation with factors such as signal-dependent noise having only a limited influence. The presence of such suboptimal inference has also been studied by Acerbi et al. (2014) using Bayesian decision theory. Regardless of these new results, it is still an open problem how to integrate the presence of suboptimal inference into the existing optimal control-based motor learning framework.

In light of the above challenges, here we propose a new sensorimotor learning theory based on adaptive dynamic programming (ADP) (Lewis, Vrabie, & Vamvoudakis, 2012; Vrabie et al., 2013; Lewis & Liu, 2013; Bian, Jiang, & Jiang, 2014, 2016; Bertsekas, 2017; He & Zhong, 2018) and its robust variant (RADP) (Jiang & Jiang, 2013, 2017; Wang, He, & Liu, 2017). ADP and RADP combine ideas from RL and (robust) optimal control theory and have several advantages over existing motor control theories. First, sharing some essential features with RL, ADP, and RADP are data-driven, non-model-based approaches that directly update the control policy without the need to identify the dynamical system. Fundamentally different from traditional RL, ADP aims at developing a stabilizing optimal control policy for discrete-time and continuous-time dynamical systems via online learning and thus is an ideal candidate for studying the model-free learning mechanism in the human sensorimotor system. Second, under our theory, motor variability plays an important role in the sensorimotor learning process. Similar to the exploration noise in RL, the active regulation of motor variability promotes the search for better control strategies in each learning cycle and, as a result, improves the learning performance in terms of accuracy and convergence speed. Moreover, both signal-dependent noise and suboptimal inference (also known as dynamic uncertainty in the nonlinear control literature; see Liu, Jiang, & Hill, 2014; Jiang & Liu, 2018) are taken into account in our model. Hence, our model of learning resolves the apparent inconsistency between existing motor control theories and the experimental observation of the positive impact of motor variability. Third, in contrast to our prior results (Jiang & Jiang, 2014, 2015), the proposed motor learning framework is based on our recently developed continuous-time value iteration (VI) approach (Bian & Jiang, 2016), in which the knowledge of an initial stabilizing control input is no longer required. As a result, the proposed ADP and RADP learning mechanisms can resolve both stability and optimality issues during online learning. Consequently, this new learning theory is more suitable for explaining, for example, model-free learning in unstable environments (Burdet et al., 2001, 2006).

During the writing of this letter, we noticed that Crevecoeur, Scott, and Cluff (2019) have also studied the model-free control mechanism in human sensorimotor systems from the perspective of $H\u221e$ control, where modeling uncertainty and signal-dependent noise are modeled as an unknown disturbance.

## 2 Human Arm Movement Model

We focus on the sensorimotor learning tasks that Harris and Wolpert (1998) and Burdet et al. (2001) considered, in which human subjects make point-to-point reaching movements in the horizontal plane.

We use $f$ to model possible external disturbances (Liu & Todorov, 2007). For example, setting $f=\beta px$ with $\beta >0$ produces the divergent force field (DF) generated by the parallel-link direct drive air-magnet floating manipulandum (PFM) used in Burdet et al. (2001).

Parameters . | Description . | Value . |
---|---|---|

$m$ | Hand mass | 1.3 kg |

$b$ | Viscosity constant | 10 N$\xb7$s$/$m |

$\tau $ | Time constant | 0.05 s |

$c1$ | Noise magnitude | 0.075 |

$c2$ | Noise magnitude | 0.025 |

Parameters . | Description . | Value . |
---|---|---|

$m$ | Hand mass | 1.3 kg |

$b$ | Viscosity constant | 10 N$\xb7$s$/$m |

$\tau $ | Time constant | 0.05 s |

$c1$ | Noise magnitude | 0.075 |

$c2$ | Noise magnitude | 0.025 |

In this letter, in contrast to Liu and Todorov (2007), we develop an iterative algorithm known as VI (Bian & Jiang, 2016, algorithm 1) to approximate $P*$ and $K*$. On the basis of this algorithm, we then give a novel model-free method to learn the optimal control policy without knowing model parameters. First, we give the VI algorithm:

Start with a $P0=P0T>0$. Set $k=0$.

- Repeat the following two steps until convergence:$Pk+1=Pk+\epsilon kATPk+PkA-PkBR-1BTPk+Q,$(2.8)where the step size $\epsilon k>0$ decreases monotonically to 0 and $\u2211k=0\u221e\epsilon k=\u221e$.$Kk+1=R-1BTPk,$(2.9)

Theorem ^{1} guarantees the convergence of the algorithm. The proof of theorem ^{1} is omitted since it is a direct extension of the proof of Bian and Jiang (2016, theorem 3.3).

For sufficiently small $\epsilon 0>0$, we have $limk\u2192\u221ePk=P*$, and $limk\u2192\u221eKk=K*$.

## 3 Model-Free Learning in Human Sensorimotor Systems

In section 2, we briefly reviewed the model-based optimal motor control problem. We have not yet touched on the topic of how the human subject learns the optimal controller when the model parameters are not precisely known.

In this section, we extend the ADP algorithm from Bian and Jiang (2019) to study human biological learning behavior. The learning process considered in this section consists of multiple trials. In each trial, the human subject performs a reaching movement from the starting point to the target. We say a learning trial is successful if the human subject can reach a predefined small neighborhood of the target state successfully. If the human subject hits the boundary of the experimental environment before reaching the target area, this learning trial is terminated and the next learning trial starts.

### 3.1 ADP-Based Model-Free Learning

^{1}

^{2}Note that $Fk$ in algorithm 1 is the advantage matrix, which contains the information of the model parameters:

Algorithm 1 is a direct extension of Bian and Jiang (2019, algorithm 2) to the stochastic environment. The convergence of algorithm 1 is guaranteed in the following theorem. It is straightforward to deduce the proof of theorem ^{2} from Bian and Jiang (2019).

If the conditions in theorem ^{1} hold and there exist $l0>0$ and $\alpha >0$ such that $1l\u2211j=1l\psi j\psi jT>\alpha I$ for all $l>l0$, then $Pk$ and $uk$ obtained from algorithm 1 converge to $P*$ and $u*$, respectively.

The initial input $u0$ in algorithm 1 represents the a priori belief on the optimal control policy. In particular, $u0$ corresponds to an initial control policy obtained from our daily experience, which may be stabilizing or optimal in the absence of external disturbance (such as the divergent force field, DF). However, in the presence of DF, $u0$ may no longer be stabilizing. In this case, the human sensorimotor system will require some form of motor learning. In the end, a new stabilizing optimal control policy with respect to this new environment will be obtained.

Algorithm 1 is an off-policy method (Sutton & Barto, 2018) in the sense that the controller updated by the algorithm—the estimation policy in RL literature (Sutton & Barto, 2018)—is different from the system input used to generate the online data (also known as behavior policy in RL literature). Indeed, the control policy learned from the $k$th iteration in our algorithm is $uk$, while $vk$ is used to generate the online data. An advantage of this difference is that the behavior policy can generate a system trajectory satisfying the persistent excitation (PE) condition on $\psi j$ in theorem ^{2} by including the exploration noise ($\xi 1$ and $\xi 2$ in our case); at the same time, we can still accurately estimate and update the estimation policy.

Note that ${\epsilon k}$ relates to the learning rate of the human subject. Especially, $\epsilon k$ is large at the beginning of the learning phase, meaning the learning mechanism is aggressive and greedy; as the number of learning iterations increases, the learning process slows down, and the human subject tends to be more conservative. Discussions on how the step size affects the learning rate and the savings behavior are given in section 5.

The online learning framework proposed in this section has two unique features that make it an ideal candidate to study human sensorimotor learning behavior. First, different from traditional motor learning models based on RL and optimal control, our learning framework is based on the continuous-time ADP. Similar to other RL methods, ADP is a model-free approach that directly updates the control policy with online data without the need to identify the dynamic model. However, unlike RL, which is mainly devised for discrete environments, ADP can tackle a large number of continuous-time dynamical systems with continuous-state-action space. Moreover, the stability and the robustness of the closed-loop dynamical system can be guaranteed under the ADP framework. Second, the proposed VI-driven learning scheme is also fundamentally different from the PI-based stochastic ADP methods in the literature (Jiang & Jiang, 2014, 2015; Bian et al., 2016). A significant improvement of using VI is that an initial stabilizing control policy is no longer required. This learning framework provides a theoretical justification for the human sensorimotor system to regain both stability and optimality from unstable environments.

### 3.2 Simulation Validation

#### 3.2.1 Divergence Force Field

Once the adaptation to the NF is achieved (i.e., the human subjects have achieved a number of successful trials), the DF is activated. At this stage, subjects practice in the DF. No information is given to the human subjects as to when the force field trials will begin. The trajectories in the first five trials in DF are shown in Figures 1a and 2a. We can easily see that when the human subject is first exposed to the DF, due to the presence of the force field ($f=\beta px$), the variations are amplified by the divergence force, and thus the movement is no longer stable under $u=-K0x$. Indeed, after inspecting the mathematical model of the motor system in the DF, we see that $A-BK0$ has positive eigenvalues.

Note from the movement profile of $px$ in Figure 2a that the divergence in $x$-direction is dependent on the initial moving direction. This initial moving direction is caused by the stochastic disturbance in $ax$ at the starting point of the movement. In other words, if there was no signal-dependent noise in the model, the movement would have always been in the $y$-direction, and hence no divergence in the $x$-direction. Moreover, we observe that compared with Figures 2b and 2c, it takes longer to correct $ax$ from negative to positive, or vice versa. Thus, we can conclude that the signal-dependent noise causes bias in the starting movement direction and eventually leads to the unstable motor behavior.

To test the after-effect, we suddenly remove the DF. The after-effect trials are shown in Figures 1c and 2c. Obviously the trajectories are much closer to the $y$-axis. This is due to the high-gain controller learned in the DF. Here, different from Burdet et al. (2001) and Franklin et al. (2003), we conjecture that during the (at least early phase of) learning process, the CNS, instead of relying on the internal model completely, simply updates the control strategy through online model-free learning. This is because conducting model identification is slow and computationally expensive (Shadmehr & Mussa-Ivaldi, 2012) and thus can provide only limited information to guide motor adaptation in the early phase of learning. On the other hand, visual and motor sensory feedbacks are extremely active during this phase in the motor learning, which in turn provide a large amount of online data to conduct ADP learning. During the later phase of motor learning, a complete internal model has been established, and predictions drawn from the internal model can be incorporated with the visual feedback to provide better estimates of the state.

#### 3.2.2 Velocity-Dependent Force Field

The simulation results are summarized in Figures 4 and 5. Different from the case in DF, the human subject maintains stability throughout the experiment. However, we see from Figures 4a and 5a that the trajectory of the hand is not a straight line and exhibits a large bias to the left-hand side. This bias is caused by the presence of VF. After 50 learning trials, the human subject regains optimality, as the trajectory is approximately a straight line and $vy$ is a bell-shaped curve. The reaching time is also within 0.7 seconds, which is consistent with experimental data (Franklin et al., 2003). This implies that model-free learning also appears in the motor adaptation in VF. Finally, the after-effect is shown in Figures 5c and 4c. Our simulation clearly shows the after-effect in VF, as the hand movement is biased to the opposite side of the VF.

## 4 Robustness to Dynamic Uncertainties

In this section, we depart from the classical optimal control framework (Todorov & Jordan, 2002; Liu & Todorov, 2007) and study the sensorimotor control mechanism from a robust and adaptive optimal control perspective. As we discussed in section 1, motor variability is usually caused by different factors. However, system 2.1 to 2.3 only models the motor variation caused by the signal-dependent noise. As another important source of motor variation, the dynamic uncertainty has not been fully considered.

The dynamic uncertainty could be attributed to the uncertainties in the internal model, especially during the early phase of learning, when the internal model may still be under construction. Dynamic uncertainty is an ideal mathematical representation of this modeling error. Moreover, dynamic uncertainty covers the the fixed model error (Crevecoeur et al., 2019) as a special case. Dynamic uncertainty may also come from model reduction (Scarciotti & Astolfi, 2017). Given that the motor system could be a multidimensional, highly nonlinear system, it is too computationally expensive for the CNS to solve the optimal control policy directly. Finding the optimal control policy for a general control system requires solving a nonlinear partial differential equation known as the Hamilton-Jacobi-Bellman (HJB) equation. Due to the curse of dimensionality (Bellman, 1957), solving the HJB equation for high-order systems is hard, if not impossible. Due to this difficulty, we conjecture that the CNS aims only at finding the optimal control policy for a simplified model, which in turn guarantees robustness to the mismatch between this simplified model and the original nonlinear system. As we show below, the presence of dynamic uncertainty does not compromise the stability of the closed-loop system, provided that a certain small-gain condition (Jiang & Liu, 2018; Liu et al., 2014) is satisfied. Moreover, the optimal controller obtained based on the simplified linear model provides similar transient behavior compared with the experimental data, even in the presence of dynamic uncertainty.

### 4.1 Robust Optimal Control Framework

For simplicity, we assume that $\Delta $ enters the sensorimotor control model through the same input channel as signal-dependent noise. The analysis here can be easily extended to the more general case with unmatched disturbance input (see Jiang & Jiang, 2015, and Bian & Jiang, 2018, for more details).

The challenge we face here is that due to the presence of $\Delta $, the optimal controller derived in previous sections may no longer stabilize this interconnected system. In addition, given that $\u03c2$ is unobservable, learning an optimal sensorimotor controller for the full interconnected system is unrealistic. To overcome these challenges, we introduce a new concept of motor learning: robust optimal learning.

^{3},

Here the objective of $u$ and $\Delta $ is to minimize and maximize $J$, respectively. It is clear that equation 4.2, together with system 2.1, 2.2, and 4.1 forms a zero-sum differential game problem. Denote by $(u\gamma *,\Delta *)$ the pair of the optimal controller and the worst-case disturbance with respect to the performance index, equation 4.2. We say $u\gamma *$ is robust optimal if it not only solves the zero-sum game presented above, but also is robustly stabilizing (with probability one) when the disturbance $\Delta $ is presented. To ensure the stability of the motor system, we conjecture that the CNS aims at developing a robust optimal controller by assigning the sensorimotor gain $\gamma $ properly.

Following the same technique in section 3.1, we can directly adopt algorithm 1 in our robust optimal controller design, except that the input signal now becomes $vk=uk+\Delta +G1(uk+\Delta )\xi 1+G2(uk+\Delta )\xi 2$, and an extra term $\gamma -2Fk,12Fk,12T$ is added in the updating equation of $Pk$ in algorithm 1.

Besides the computational efficiency, an additional benefit of considering a robust optimal controller is that the signal-dependent noise and the disturbance input from the dynamic uncertainty can facilitate motor exploration during the learning phase. Note that if $\Delta $ and the signal-dependent noise do not exist, then $x$ and $vk$ become linearly dependent. As a result, the condition on $\psi j$ in theorem ^{4} is no longer satisfied. In fact, these disturbances play a role similar to that of the exploration noise in RL.

### 4.2 Robust Stability Analysis

In this section, we analyze the stability of the closed-loop system in the presence of signal-dependent noise and dynamic uncertainty. Before giving the robust stability analysis, we first impose the following assumption on the dynamic uncertainty:

Assumption 1 essentially assumes that dynamic uncertainty admits a stochastic linear $L2$ gain less than or equal to $\gamma 0$, with $x$ as input and $\Delta $ as output. Using a small-gain type of theorem, we have the following result:

The proof of theorem ^{4} is provided in the appendix. Although theorem ^{4} requires small $|G1|$ and $|G2|$, this does not necessarily imply that the variance of the signal-dependent noise is small, since the stochastic noise is also dependent on the system input.

Finally, note that the proposed RADP framework also improves our recent results (Bian & Jiang, 2016, 2018, 2019) by considering both signal-dependent noise and dynamic uncertainty in the synthesis of our learning algorithm. This improvement increases the usability of our learning algorithm in practical applications.

### 4.3 Simulation Validation

We first simulate the same sensorimotor learning experiment in DF as in section 3.2, with the same parameters. Note that equations 4.2 and 4.3 are considered here. Simulation results under the RADP design are presented in Figures 7 and 8. To reveal the impacts of dynamic uncertainty and compare with the results in section 3.2, we choose a large $\gamma =10$ here. Note that even under dynamic uncertainty, both stability and optimality can be achieved under RADP learning. The after-effect is also clearly illustrated. In fact, Figures 7 and 8 are quite similar to the results in section 3.2. However, one noticeable difference in Figure 8 is that the system state has larger variation. Furthermore, the end-point variance is much larger compared with the trajectories in Figure 2, clearly due to the disturbance from dynamic uncertainty. This observation confirms our theoretical analysis that the presence of dynamic uncertainties indeed compromises the stability and robustness of the closed-loop system.

To further illustrate the impact of sensorimotor gains, we plot system trajectories after RADP learning under different values of $\gamma $ and $\gamma 0$ in Figure 9. Comparing Figures 9a and 9b (and also Figures 9c and 9d), we observe that for a fixed $\gamma 0$, the smaller $\gamma $ is, the more stable the motor system is. In other words, the robustness of the motor system can be tuned by including the term $\gamma 2|\Delta |2$ in equation 4.2. By symmetry, for a fixed $\gamma $, a smaller $\gamma 0$ leads to a more stable trajectory, and when $\gamma 0$ becomes sufficiently large, the dynamic uncertainty has a large input-output gain, thereby giving rise to instability in the closed-loop system (see Figures 9a, 9c, and 9e). When both $\gamma $ and $\gamma 0$ are large enough, the motor system may exhibit instability (see Figure 9f). These phenomena are in line with the small-gain theory (Jiang & Liu, 2018).

Nevertheless, the increase of state variation promotes the exploration effect, in the sense that the condition on $\psi j$ in theorem ^{2} can be easily satisfied. In Table 2, we calculate the conditional number of $1l\u2211j=1l\psi j\psi jT$ under different $\gamma 0$. Denote $\lambda m$ and $\lambda M$ as the minimum and maximum eigenvalues of $1l\u2211j=1l\psi j\psi jT$, respectively. We simulate the first learning trial in DF for 0.7 s and calculate the conditional number $\lambda M/\lambda m$ for different choices of $\gamma 0$. Note that the exploration noise should be chosen so that the closed-loop system is stable and that $1l\u2211j=1l\psi j\psi jT$ does not exhibit singularity—that is, the conditional number $\lambda M/\lambda m$ should be small. This way, the control policy can be updated using algorithm 1 with high accuracy. We see from Table 2 that by increasing $\gamma 0$, the conditional number of matrix $1l\u2211j=1l\psi j\psi jT$ is reduced. This, together with Figure 9, illustrates that the motor variability should be properly regulated to promote motor learning.

## 5 Model-Free Learning and Savings

Huang et al. (2011) and Haith and Krakauer (2013) have claimed that model-free learning is a key factor behind the savings. In this section, we investigate the relationship between our adaptive (robust) optimal control approach and the learning algorithms developed in the literature (Smith, Ghazizadeh, & Shadmehr, 2006; Zarahn, Weston, Liang, Mazzoni, & Krakauer, 2008; Vaswani et al., 2015) to explain savings.

### 5.1 Learning Rate and Error Sensitivity

A key requirement in our learning algorithm is that $\epsilon 0$ (step size) should not be too large, that is, the learning process cannot be arbitrarily fast. This assumption matches the common sense that the human subject usually cannot dramatically improve her motor performance in a single learning trial. As we illustrated in equation 3.1, our learning algorithm is essentially driven by the TD error. Step size is related to sensitivity to the TD error. Since step size is decreasing in our algorithm, error sensitivity is also decreasing. This is because at the initial phase of learning, $P0$, which represents our prior estimate on $P*$, is far from $P*$. Hence, we have to rely more on the TD error feedback from the environment to adjust our estimate on $P*$. As the trial number increases, $Pk$ becomes closer to $P*$, and the TD error has less contribution to the learning: the human subject is unwilling to adjust the control policy because the motor outcome is already quite satisfactory.

To illustrate the influence of step size, we simulate the first 50 learning trials in the DF. For simplicity, we fix $a=1$ in the simulation. The degree of motor adaptation is measured as $|Kk-K*|$ at the $k$th trial, which represents the difference between the optimal controller and the controller learned from ADP algorithm. Our simulation result is given in Figure 10. Note that when the step size is small, the learning rate is also small. In this case, motor learning is steady yet slow. Especially, the adaptation curve is smooth, and no oscillation is observed. As we increase the step size, the learning rate starts to increase. However, when the step size is too large, the adaptation curve is no longer smooth and monotone. The adaptation error increases during a short period of time and, as a result, leads to a slower convergence speed. This implies that a large step size may compromise learning speed and accuracy. In fact, when the step size is too large, the learning algorithm becomes unstable, and learning fails.

### 5.2 Multirate Model

Next, we simulate the motor adaptation and savings behaviors in VF. The measurement criterion of motor adaptation is still chosen as the one in section 5.1: the adaption error at the $k$th trial is defined as $|Kk-K*|$. The simulation result is given in Figure 11. First, the human subject conducts 50 trials of motor movement in the VF. Figure 11 shows that $Kk$ gradually converges to the optimal control gain. Next, we simulate 5 washout trials in NF. Then the human subject is reexposed to the VF and conducts another 50 learning trials. Note that the adaptation in the second learning phase is faster than in the first learning phase. During the washout trials, the slow state is not fully reset, and the fast state shows a clear overshoot. Moreover, we see that the fast state curve is not smooth due to the large step size. A similar phenomenon appears in the experimental results in Smith et al. (2006) and Zarahn et al. (2008).

Similar to the case in VF, we also study the motor adaptation and savings behaviors in DF. We can see from Figure 12 that the multirate ADP model also recreates savings behavior in DF.

Compared with the existing literature, the proposed framework has a solid theoretical background, as a detailed convergence analysis can be drawn from our ADP theory. Moreover, our model incorporates the kinetic model of the human motor system into the sensorimotor learning algorithm. As a result, our ADP-based learning algorithm provides a unified framework for the human motor learning system.

## 6 Discussion

### 6.1 Summary of the Main Results

In this letter, we have investigated human sensorimotor learning from an adaptive optimal control perspective. In particular, the model we have developed shares several similar features with existing results, such as the presence of model-free learning (Huang et al., 2011; Haith & Krakauer, 2013), an alternative source of motor variability (Beck et al., 2012; Bach & Dolan, 2012; Renart & Machens, 2014; Acerbi et al., 2014), and the fact that actively regulated motor variability promotes sensorimotor learning (Renart & Machens, 2014; Wu et al., 2014; Cashaback et al., 2015; Lisberger & Medina, 2015; Pekny et al., 2015; Vaswani et al., 2015). The key idea behind our learning theory is that a specific model is not required, and the motor control strategy can be iteratively improved directly using data from sensory feedback. This learning scheme is especially useful in the early stage of sensorimotor learning, since during this period, the internal model is still under development and the CNS relies more on sensory feedback to fine-tune the motor commands. We have used the proposed learning framework to study the motor learning experiment in both a divergent force field and a velocity-dependent force field. Our model successfully reproduces the experimental phenomena reported in the work of others (Burdet et al., 2001; Franklin et al., 2003; Zhou et al., 2016). In particular, the model, like human subjects, can regain stability and optimality through ADP learning even in an unstable environment with external perturbation. In addition, we have extended our adaptive optimal controller design to tackle the robust optimal control problem caused by the presence of dynamic uncertainties. Dynamic uncertainties may appear in the human motor system as modeling uncertainties (Beck et al., 2012; Renart & Machens, 2014) and dynamic external disturbance (Jiang & Jiang, 2015). Using the robust optimal controller allows us to analyze the influence of dynamic uncertainties on the stability of human motor systems. As we have shown in the simulation, the motor system can still achieve stability in the presence of dynamic uncertainties, provided that a small sensorimotor gain is assigned by the CNS. Moreover, a larger motor variation has been observed as a result of the disturbance input from dynamic uncertainties. Our model shows that this motor variability can contribute to ADP learning and, as a result, promote the sensorimotor learning. Finally, our simulation suggests that the model-free learning may indeed be linked to the savings phenomenon.

### 6.2 Reinforcement Learning and Adaptive Dynamic Programming

The idea of RL can be traced back to Minsky's PhD dissertation (Minsky, 1954). An essential characteristic of RL is that it provides an efficient way to solve dynamic programming problems without using any modeling information of the underlying system. Due to this advantage, RL has become an ideal candidate to model human decision making and learning behavior (Doya, Samejima, Katagiri, & Kawato, 2002; Dayan & Balleine, 2002; Rangel, Camerer, & Montague, 2008; Glimcher, 2011; Bernacchia, Seo, Lee, & Wang, 2011; Taylor & Ivry, 2014).

Despite the appealing features of RL, it is difficult to show the convergence of the learning scheme or analyze the stability and robustness of the motor system. Moreover, since both the time and the state-action-space are continuous in motor systems, it is not trivial to extend traditional RL techniques to study a sensorimotor control mechanism. Similar to other RL methods, ADP is a non-model-based approach that directly updates the control policy without the need to identify the dynamic model. Different from traditional RL, ADP aims at developing a stabilizing optimal control policy for dynamical systems via online learning. ADP-based optimal control designs for dynamical systems have been investigated by several research groups over the past few years. Compared with the extensive results on RL, the research on ADP, especially for continuous-time dynamical systems, is still underdeveloped. In this letter, we have introduced a novel sensorimotor learning framework built on top of our recent results on continuous-time ADP and RADP methods (Bian & Jiang, 2016, 2018, 2019). These results bypass several obstacles in existing learning algorithms by relaxing the requirement on the initial condition, reducing the computational complexity, and covering a broader class of disturbances.

### 6.3 Sensorimotor Noise Enhances Motor Exploration

It has been conjectured (Harris & Wolpert, 1998; van Beers et al., 2002; Haruno & Wolpert, 2005) over the past decade that the goal of the motor system is to minimize end-point variance caused by signal-dependent noise. Later, Todorov and Jordan (2002) and Todorov (2004, 2005) further explored this idea by using linear optimal control theory based on the LQR or LQG methods. However, several recent experimental results (Wu et al., 2014; Cashaback et al., 2015) suggest that motor variability facilitates motor exploration and, as a result, can increase learning speed. These surprising results have challenged the optimal control viewpoint in the sense that motor variability is not purely an unwanted consequence of sensorimotor noise whose effects should be minimized.

In this letter, we have justified the contribution of motor variability from a robust/adaptive optimal control perspective based on ADP and RADP theory. Indeed, motor variability serves a similar role as exploration noise, which has been proved essential to ADP and RADP learning. To be specific, if motor variability is regulated properly, we can show mathematically that the system will keep improving motor behavior until convergence. Hence, our model can resolve the inconsistency between existing motor control theories and recent experimental discoveries on motor variability. Moreover, our model also shows that the existence of exploration noise does not destabilize the motor system even for learning tasks in an unstable environment.

### 6.4 System Decomposition, Small-Gain Theorem, and Quantification of the Robustness-Optimality Trade-Off

A novel feature of the proposed motor learning theory is the viewpoint of system decomposition. Building an exact mathematical model for biological systems is usually difficult. Furthermore, even if the system model is precisely known, it may be highly nonlinear, and it is generally impossible to solve the DP equation to obtain the optimal controller. In this case, simplified models (nominal motor system) are often preferable (Beck et al., 2012). The mismatch between the simplified model and the original motor system is referred to as dynamic uncertainty. Generally dynamic uncertainty involves unmeasurable state variables and unknown system order. After decomposing the system into an interconnection of the nominal model and dynamic uncertainty, we only need to design a robust optimal control policy using partial-state feedback. To handle the dynamic interaction between two subsystems, the robust gain assignment and small-gain techniques (Jiang & Liu, 2018; Liu et al., 2014) in modern nonlinear control theory are employed in the control design. In this way, we can preserve the near-optimality for the motor system, as well as guarantee the robustness of stability for the overall system.

### 6.5 Comparisons with Other Learning Methods

We note several approaches that also aim at explaining sensorimotor learning mechanisms from a model-free perspective.

Zhou et al. (2016) studied the human motor learning problem using model reference iterative learning control (MRILC). In this framework, a reference model is learned from the data during the initial phase of learning. Then the motor command is updated through the iterative learning algorithm by comparing the different outputs between the reference model and the real-world model. Fundamentally different from the model-free learning presented in this letter, MRILC relies heavily on the knowledge of a reference model, which plays the same role as an internal model. However, it is not clear how the CNS conducts motor learning before establishing the reference model. In addition, the effect of different types of motor variations is still not considered in the MRILC learning scheme. Note that these difficulties do not occur in our learning theory.

An alternative way is to update the motor command directly using error feedback (Herzfeld, Vaswani, Marko, & Shadmehr, 2014; Vaswani et al., 2015; Albert & Shadmehr, 2016). In this model, a difference equation is used to generate the motor prediction. Then, by comparing the predicted motor output with the sensory feedback data, a prediction error is obtained and used to modify the prediction in the next learning trial. By considering different error sensitivities (Herzfeld et al., 2014) and structures (Smith et al., 2006), it is possible to reproduce some experimental phenomena, such as savings, using this model. This model represents the relationship between motor command and prediction error as a static function, and the dynamical system of the kinetic model is ignored. This missing link between optimal control theory (Todorov & Jordan, 2002) and the error-updating model (Herzfeld et al., 2014) raises several open questions, including the convergence problem of the algorithm and the stability issue of the kinetic model. On the other hand, the framework we propose in our letter incorporates the numerical optimization framework into the optimal control design for motor systems. Instead of using the prediction error, our learning model is driven by the TD error, and rigorous convergence analysis has been provided.

## Appendix: Proof of Theorem ^{4}

## Notes

^{1}

For any $x\u2208Rn$, denote $q(x)=[x12,2x1x2,\u2026,2x1xn,x22,2x2x3,\u2026,2xn-1xn,xn2]T.$

^{2}

For any $A=AT\u2208Rn\xd7n$, denote $vech(A)=[a11,a12,\u2026,a1n,a22,a23,\u2026,an-1n,ann]T$, where $aij\u2208R$ is the $(i,j)$th element of matrix $A$.

^{3}

$|\xb7|$ denotes the Euclidean norm for vectors or the induced matrix norm for matrices.

^{4}

A function $f:Rn\u2192R+$ is called proper, if $lim|x|\u2192\u221ef(x)=\u221e$.

## Acknowledgments

This work was partially supported by the U.S. National Science Foundation under grants ECCS-1230040, ECCS-1501044, and EPCN-1903781 to Z.P.J.; and the Wellcome Trust and the Royal Society Noreen Murray Professorship in Neurobiology (to D.M.W.).