Sensorimotor tasks that humans perform are often affected by different sources of uncertainty. Nevertheless, the central nervous system (CNS) can gracefully coordinate our movements. Most learning frameworks rely on the internal model principle, which requires a precise internal representation in the CNS to predict the outcomes of our motor commands. However, learning a perfect internal model in a complex environment over a short period of time is a nontrivial problem. Indeed, achieving proficient motor skills may require years of training for some difficult tasks. Internal models alone may not be adequate to explain the motor adaptation behavior during the early phase of learning. Recent studies investigating the active regulation of motor variability, the presence of suboptimal inference, and model-free learning have challenged some of the traditional viewpoints on the sensorimotor learning mechanism. As a result, it may be necessary to develop a computational framework that can account for these new phenomena. Here, we develop a novel theory of motor learning, based on model-free adaptive optimal control, which can bypass some of the difficulties in existing theories. This new theory is based on our recently developed adaptive dynamic programming (ADP) and robust ADP (RADP) methods and is especially useful for accounting for motor learning behavior when an internal model is inaccurate or unavailable. Our preliminary computational results are in line with experimental observations reported in the literature and can account for some phenomena that are inexplicable using existing models.

Humans develop coordinated movements that allow efficient interaction with the environment. Despite extensive research on the topic, the underlying computational mechanism of sensorimotor control and learning is still largely an open problem (Flash & Hogan, 1985; Uno, Kawato, & Suzuki, 1989; Harris & Wolpert, 1998; Haruno & Wolpert, 2005; Todorov & Jordan, 2002; Todorov, 2004, 2005; Bhushan & Shadmehr, 1999; Shadmehr & Mussa-Ivaldi, 1994; Wolpert & Ghahramani, 2000). Indeed, recent research findings, including model-free learning (Huang, Haith, Mazzoni, & Krakauer, 2011; Haith & Krakauer, 2013), the active regulation of motor variability (Renart & Machens, 2014; Wu, Miyamoto, Castro, Olveczky, & Smith, 2014; Cashaback, McGregor, & Gribble, 2015; Lisberger & Medina, 2015; Pekny, Izawa, & Shadmehr, 2015; Vaswani et al., 2015), and the presence of suboptimal inference (Beck, Ma, Pitkow, Latham, & Pouget, 2012; Bach & Dolan, 2012; Renart & Machens, 2014; Acerbi, Vijayakumar, & Wolpert, 2014), have challenged some of the traditional models of sensorimotor learning, potentially requiring the development of a new computational framework.

Several computational theories have been proposed to account for sensorimotor control and learning (Shadmehr & Mussa-Ivaldi, 2012). One widely accepted conjecture is that the central nervous system (CNS) selects trajectories so as to minimize a cost function (Flash & Hogan, 1985; Uno et al., 1989; Harris & Wolpert, 1998; Haruno & Wolpert, 2005; Todorov & Jordan, 2002; Qian, Jiang, Jiang, & Mazzoni, 2012). This perspective has inspired a number of optimization-based models of motor control over the past three decades. In early work, Flash and Hogan (1985) and Uno et al. (1989) proposed that the CNS coordinates movements by minimizing the time integral of the jerk or torque change. Although simulations under these theories are consistent with experimental results, it is not clear why and how the CNS would minimize these specific types of costs. Being aware of this difficulty, Wolpert and his coworkers (Harris & Wolpert, 1998; van Beers, Baraduc, & Wolpert, 2002; Haruno & Wolpert, 2005) suggested an alternative theory that the goal of the motor system is to minimize the end-point variance caused by signal-dependent noise. Later, Todorov and his colleagues (Todorov & Jordan, 2002; Todorov, 2004, 2005) considered sensorimotor control within the framework of linear quadratic regulator (LQR) and linear quadratic gaussian (LQG) theories and conjectured that the CNS aims to minimize a mixed cost function with components that specify both accuracy and energy costs. Despite the different interpretations of the cost, a common assumption in these frameworks is that the CNS first identifies the system dynamics and then solves the optimization or optimal control problem based on the identified model (Shadmehr & Mussa-Ivaldi, 1994; Wolpert, Ghahramani, & Jordan, 1995; Kawato, 1999; Todorov & Jordan, 2002; Liu & Todorov, 2007; Zhou et al., 2016). Indeed, this identification-based idea has been used extensively to study motor adaptation under external force field perturbations (Shadmehr & Mussa-Ivaldi, 1994; Bhushan & Shadmehr, 1999; Burdet, Osu, Franklin, Milner, & Kawato, 2001; Franklin, Burdet, Osu, Kawato, & Milner, 2003). Although these models can explain many characteristics of motor control, such as approximately straight movement trajectories and bell-shape velocity curves (Morasso, 1981), there is no compelling experimental evidence as to how the CNS manages to generate a perfect internal representation of the environment in a short period of time, especially for complex environments.

Huang et al. (2011) and Haith and Krakauer (2013) proposed a different learning mechanism, known as model-free learning, to explain sensorimotor learning behavior. Some well-known experimentally validated phenomena, such as savings, could be attributed to this learning mechanism. Huang et al. (2011), Huberdeau, Krakauer, and Haith (2015), and Vaswani et al. (2015) studied these experimental results via reinforcement learning (RL) (Sutton & Barto, 2018), a theory in machine learning that studies how an agent iteratively improves its actions based on the observed responses from its interacting environment. The study on RL was originally inspired by the decision-making process in animals and humans (Minsky, 1954). Doya (2000) discussed that certain brain areas can realize the RL and suggested a learning scheme for the neurons based on temporal difference (TD) learning (Sutton, 1988). Izawa, Rane, Donchin, and Shadmehr (2008) used an actor-critic-based optimal learner in which an RL scheme was proposed to directly update the motor command. A possible shortcoming of traditional RL is that discretization and sampling techniques are needed to transform a continuous-time problem into the setting of discrete-time systems with discrete-state-action space, which may be computationally intensive. Moreover, rigorous convergence proofs and stability analysis are usually missing in the related literature.

Another discovery that has challenged the traditional motor learning framework is that the CNS can regulate, and even amplify, motor variability instead of minimizing its effects (Renart & Machens, 2014; Wu et al., 2014; Cashaback et al., 2015; Lisberger & Medina, 2015; Pekny et al., 2015; Vaswani et al., 2015). Wu et al. (2014) and Cashaback et al. (2015) conjectured that this puzzling phenomenon is related to the use of RL in sensorimotor learning. Motor variability facilitates the exploration phase in RL and, as a result, promotes motor learning. The importance of motor variability was also illustrated in Pekny et al. (2015) by showing that the ability to increase motor variability is impaired in patients with Parkinson's disease. Despite these experimental results, there still lacks a convincing theoretical analysis that can justify the need to regulate motor variability.

Finally, it has been reported recently (Beck et al., 2012; Bach & Dolan, 2012; Renart & Machens, 2014; Acerbi et al., 2014) that motor variability, traditionally thought of as a consequence of the internal noise characterized by neural variation in the sensorimotor circuit (Harris & Wolpert, 1998; van Beers, 2007; Faisal, Selen, & Wolpert, 2008; Chaisanguanthum, Shen, & Sabes, 2014; Herzfeld, Vaswani, Marko, & Shadmehr, 2014), can also arise through suboptimal inference. Beck et al. (2012) argued that suboptimal inference, usually caused by modeling errors of the real-world environment, should be the dominant factor in motor variation with factors such as signal-dependent noise having only a limited influence. The presence of such suboptimal inference has also been studied by Acerbi et al. (2014) using Bayesian decision theory. Regardless of these new results, it is still an open problem how to integrate the presence of suboptimal inference into the existing optimal control-based motor learning framework.

In light of the above challenges, here we propose a new sensorimotor learning theory based on adaptive dynamic programming (ADP) (Lewis, Vrabie, & Vamvoudakis, 2012; Vrabie et al., 2013; Lewis & Liu, 2013; Bian, Jiang, & Jiang, 2014, 2016; Bertsekas, 2017; He & Zhong, 2018) and its robust variant (RADP) (Jiang & Jiang, 2013, 2017; Wang, He, & Liu, 2017). ADP and RADP combine ideas from RL and (robust) optimal control theory and have several advantages over existing motor control theories. First, sharing some essential features with RL, ADP, and RADP are data-driven, non-model-based approaches that directly update the control policy without the need to identify the dynamical system. Fundamentally different from traditional RL, ADP aims at developing a stabilizing optimal control policy for discrete-time and continuous-time dynamical systems via online learning and thus is an ideal candidate for studying the model-free learning mechanism in the human sensorimotor system. Second, under our theory, motor variability plays an important role in the sensorimotor learning process. Similar to the exploration noise in RL, the active regulation of motor variability promotes the search for better control strategies in each learning cycle and, as a result, improves the learning performance in terms of accuracy and convergence speed. Moreover, both signal-dependent noise and suboptimal inference (also known as dynamic uncertainty in the nonlinear control literature; see Liu, Jiang, & Hill, 2014; Jiang & Liu, 2018) are taken into account in our model. Hence, our model of learning resolves the apparent inconsistency between existing motor control theories and the experimental observation of the positive impact of motor variability. Third, in contrast to our prior results (Jiang & Jiang, 2014, 2015), the proposed motor learning framework is based on our recently developed continuous-time value iteration (VI) approach (Bian & Jiang, 2016), in which the knowledge of an initial stabilizing control input is no longer required. As a result, the proposed ADP and RADP learning mechanisms can resolve both stability and optimality issues during online learning. Consequently, this new learning theory is more suitable for explaining, for example, model-free learning in unstable environments (Burdet et al., 2001, 2006).

During the writing of this letter, we noticed that Crevecoeur, Scott, and Cluff (2019) have also studied the model-free control mechanism in human sensorimotor systems from the perspective of H control, where modeling uncertainty and signal-dependent noise are modeled as an unknown disturbance.

We focus on the sensorimotor learning tasks that Harris and Wolpert (1998) and Burdet et al. (2001) considered, in which human subjects make point-to-point reaching movements in the horizontal plane.

In our computer experiment, the dynamics of the arm are simplified to a point-mass model as follows:
p˙=v,
(2.1)
mv˙=a-bv+f,
(2.2)
τa˙=u-a+G1uξ1+G2uξ2,
(2.3)
where p=[pxpy]T, v=[vxvy]T, a=[axay]T, and u=[uxuy]T denote the two-dimensional hand position, velocity, actuator state, and control input, respectively; m denotes the mass of the hand; b is the viscosity constant; τ is the time constant; ξ1 and ξ2 are gaussian white noises (Arnold, 1974); and
G1=c10c20andG2=0-c20c1
are gain matrices of the signal-dependent noise (Harris & Wolpert, 1998; Liu & Todorov, 2007).

We use f to model possible external disturbances (Liu & Todorov, 2007). For example, setting f=βpx with β>0 produces the divergent force field (DF) generated by the parallel-link direct drive air-magnet floating manipulandum (PFM) used in Burdet et al. (2001).

To fit this model into the standard optimal control framework, we rewrite system 2.1 to 2.3 with f=0 in the form of a stochastic dynamical system (Arnold, 1974),
dx=Axdt+B(udt+G1udw1+G2udw2),
(2.4)
where w1 and w2 are standard Brownian motions, and
x=pva,A=00100000010000-bm01m0000-bm01m0000-1τ000000-1τ,B=000000001τ001τ.
The model parameters used in our simulations throughout this letter are given in Table 1.
Table 1:
Parameters of the Arm Movement Model.
ParametersDescriptionValue
m Hand mass 1.3 kg 
b Viscosity constant 10 N·s/
τ Time constant 0.05 s 
c1 Noise magnitude 0.075 
c2 Noise magnitude 0.025 
ParametersDescriptionValue
m Hand mass 1.3 kg 
b Viscosity constant 10 N·s/
τ Time constant 0.05 s 
c1 Noise magnitude 0.075 
c2 Noise magnitude 0.025 
Following Todorov and Jordan (2002) and Liu and Todorov (2007), the optimal control problem is formulated as one of finding an optimal controller to minimize the following cost with respect to the nominal system of equation 2.4 without the signal-dependent noise,
J(x(0);u)=0xTQx+uTRudt,
(2.5)
where Q=QT>0 and R=RT>0 are constant weighting matrices.
It is well known that J is minimized under the optimal controller u*=-K*x, where K*=R-1BTP*, with P*=P*T>0 the unique solution to the following algebraic Riccati equation:
ATP+PA-PBR-1BTP+Q=0.
(2.6)
Moreover, infuJ(x;u)=xTP*x.
Note that Q and R represent the trade-off between movement accuracy (Q) and the effort exerted by the human subject to accomplish the task (R). Generally, choosing R with small eigenvalues leads to a high-gain optimal controller. This improves the transient performance, yet the price to be paid is a large control input and higher energy consumption. For illustration, we define R=I2 and Q as
xTQx=0.3xxTQxxx+0.7xyTQyxy,
(2.7)
where xx and xy are the components in x- and y-coordinates of the system state, respectively, and
Qx=Qy=1×1040001×1020001×10-3.

In this letter, in contrast to Liu and Todorov (2007), we develop an iterative algorithm known as VI (Bian & Jiang, 2016, algorithm 1) to approximate P* and K*. On the basis of this algorithm, we then give a novel model-free method to learn the optimal control policy without knowing model parameters. First, we give the VI algorithm:

  1. Start with a P0=P0T>0. Set k=0.

  2. Repeat the following two steps until convergence:
    Pk+1=Pk+εkATPk+PkA-PkBR-1BTPk+Q,
    (2.8)
    Kk+1=R-1BTPk,
    (2.9)
    where the step size εk>0 decreases monotonically to 0 and k=0εk=.

Theorem 1 guarantees the convergence of the algorithm. The proof of theorem 1 is omitted since it is a direct extension of the proof of Bian and Jiang (2016, theorem 3.3).

Theorem 1.

For sufficiently small ε0>0, we have limkPk=P*, and limkKk=K*.

In section 2, we briefly reviewed the model-based optimal motor control problem. We have not yet touched on the topic of how the human subject learns the optimal controller when the model parameters are not precisely known.

In this section, we extend the ADP algorithm from Bian and Jiang (2019) to study human biological learning behavior. The learning process considered in this section consists of multiple trials. In each trial, the human subject performs a reaching movement from the starting point to the target. We say a learning trial is successful if the human subject can reach a predefined small neighborhood of the target state successfully. If the human subject hits the boundary of the experimental environment before reaching the target area, this learning trial is terminated and the next learning trial starts.

3.1  ADP-Based Model-Free Learning

Before giving our online ADP learning algorithm, we introduce an increasing time sequence tj, 0t0<t1<<tl-1<tl in one learning trial, where the movement starts at time 0 and tl is the time when the human subject reaches the target area or hits the boundary of the experimental environment. Over [tj,tj+1], we introduce the following feature vectors,1
ψj=tjtj+1q[xTvkT]Tdt,φj=qT(x)tjtj+1-tjtj+1qT(dx)tjtj+1rkdtT,
where rk(t)=xT(t)Qx(t)+ukT(t)Ruk(t) and vk=uk+G1ukξ1+G2ukξ2.
formula
Now we are ready to give our ADP algorithm (algorithm 1).2 Note that Fk in algorithm 1 is the advantage matrix, which contains the information of the model parameters:
Fk=PkA+ATPk+QPkBBTPkR:=Fk,11Fk,12Fk,12TFk,22.

Algorithm 1 is a direct extension of Bian and Jiang (2019, algorithm 2) to the stochastic environment. The convergence of algorithm 1 is guaranteed in the following theorem. It is straightforward to deduce the proof of theorem 2 from Bian and Jiang (2019).

Theorem 2.

If the conditions in theorem 1 hold and there exist l0>0 and α>0 such that 1lj=1lψjψjT>αI for all l>l0, then Pk and uk obtained from algorithm 1 converge to P* and u*, respectively.

The initial input u0 in algorithm 1 represents the a priori belief on the optimal control policy. In particular, u0 corresponds to an initial control policy obtained from our daily experience, which may be stabilizing or optimal in the absence of external disturbance (such as the divergent force field, DF). However, in the presence of DF, u0 may no longer be stabilizing. In this case, the human sensorimotor system will require some form of motor learning. In the end, a new stabilizing optimal control policy with respect to this new environment will be obtained.

Algorithm 1 is an off-policy method (Sutton & Barto, 2018) in the sense that the controller updated by the algorithm—the estimation policy in RL literature (Sutton & Barto, 2018)—is different from the system input used to generate the online data (also known as behavior policy in RL literature). Indeed, the control policy learned from the kth iteration in our algorithm is uk, while vk is used to generate the online data. An advantage of this difference is that the behavior policy can generate a system trajectory satisfying the persistent excitation (PE) condition on ψj in theorem 2 by including the exploration noise (ξ1 and ξ2 in our case); at the same time, we can still accurately estimate and update the estimation policy.

Note that {εk} relates to the learning rate of the human subject. Especially, εk is large at the beginning of the learning phase, meaning the learning mechanism is aggressive and greedy; as the number of learning iterations increases, the learning process slows down, and the human subject tends to be more conservative. Discussions on how the step size affects the learning rate and the savings behavior are given in section 5.

It is interesting to note that our ADP learning algorithm shares some similarities with TD learning. Indeed, when εk>0 is sufficiently small, we have
εxT(t)(Fk,11-Fk,12F22-1Fk,12T)x(t)=εinfu{(Ax(t)+Bu)TPk+Pk(Ax(t)+Bu)+xT(t)Qx(t)+uTRu}infuxT(t+ε)Pkx(t+ε)-xT(t)Pkx(t)+tt+ε(xTQx+uTRu)ds,
(3.1)
which is consistent with the definition of TD error (Sutton, 1988). Note that this error term represents the difference between P* and Pk, since equation 3.1 reduces to zero when Pk=P*.

The online learning framework proposed in this section has two unique features that make it an ideal candidate to study human sensorimotor learning behavior. First, different from traditional motor learning models based on RL and optimal control, our learning framework is based on the continuous-time ADP. Similar to other RL methods, ADP is a model-free approach that directly updates the control policy with online data without the need to identify the dynamic model. However, unlike RL, which is mainly devised for discrete environments, ADP can tackle a large number of continuous-time dynamical systems with continuous-state-action space. Moreover, the stability and the robustness of the closed-loop dynamical system can be guaranteed under the ADP framework. Second, the proposed VI-driven learning scheme is also fundamentally different from the PI-based stochastic ADP methods in the literature (Jiang & Jiang, 2014, 2015; Bian et al., 2016). A significant improvement of using VI is that an initial stabilizing control policy is no longer required. This learning framework provides a theoretical justification for the human sensorimotor system to regain both stability and optimality from unstable environments.

3.2  Simulation Validation

3.2.1  Divergence Force Field

First, we simulate the motor learning experiment in the divergence force field (DF). In this case, we choose f=βpx in equation 2.1 with β>0 to represent the DF generated by the PFM. Here we pick β=150. Since before conducting the experiment, the human subjects are asked to practice in the NF for a long period, we assume that the human subject has already adapted to this NF; that is, an optimal controller with respect to the NF has been obtained. We denote the control gain matrix with respect to this optimal controller in the NF as K0 and the corresponding performance matrix as P0:
P0=1039.230.0087.600.002.740.000.001949.880.00144.700.004.1887.600.0010.440.000.330.000.00144.700.0016.860.000.502.740.000.330.000.010.000.004.180.000.500.000.02,K0=54.770.006.670.000.230.000.0083.670.0010.000.000.33.

Once the adaptation to the NF is achieved (i.e., the human subjects have achieved a number of successful trials), the DF is activated. At this stage, subjects practice in the DF. No information is given to the human subjects as to when the force field trials will begin. The trajectories in the first five trials in DF are shown in Figures 1a and 2a. We can easily see that when the human subject is first exposed to the DF, due to the presence of the force field (f=βpx), the variations are amplified by the divergence force, and thus the movement is no longer stable under u=-K0x. Indeed, after inspecting the mathematical model of the motor system in the DF, we see that A-BK0 has positive eigenvalues.

Figure 1:

ADP learning in the DF. Five hand paths are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after ADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Figure 1:

ADP learning in the DF. Five hand paths are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after ADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Figure 2:

ADP learning in the DF. Plots show time series of position, velocity, and acceleration in the x- and y-dimension for a reaching movement to a target displayed only in y from the start location. Five sequential trials are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after ADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Figure 2:

ADP learning in the DF. Plots show time series of position, velocity, and acceleration in the x- and y-dimension for a reaching movement to a target displayed only in y from the start location. Five sequential trials are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after ADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Note from the movement profile of px in Figure 2a that the divergence in x-direction is dependent on the initial moving direction. This initial moving direction is caused by the stochastic disturbance in ax at the starting point of the movement. In other words, if there was no signal-dependent noise in the model, the movement would have always been in the y-direction, and hence no divergence in the x-direction. Moreover, we observe that compared with Figures 2b and 2c, it takes longer to correct ax from negative to positive, or vice versa. Thus, we can conclude that the signal-dependent noise causes bias in the starting movement direction and eventually leads to the unstable motor behavior.

Denote the optimal gain matrix in the DF as K*. Starting from K0 and P0, the control gain matrix obtained after 50 learning trials is already very close to K*:
K50=426.430.0028.110.000.780.000.0083.670.0010.000.000.33,K*=426.480.0028.120.000.780.000.0083.670.0010.000.000.33.
The simulation results of the sensorimotor system under this new control policy are given in Figures 1b and 2b. Comparing Figure 1b with Figure 1a, we can see that after learning, the human subject has regained stability in the DF. Indeed, compared with K0, some elements in the first row of K50 are much larger, indicating a higher gain in the x-direction (i.e., the direction of the divergence force). To further illustrate the effect of high-gain feedback, the stiffness adaptation is shown in Figure 3. During the learning process, stiffness increased significantly in the x-direction. Moreover, we see from Figure 2b that at the beginning of the movement, the magnitude of ax due to noise is not negligible compared with Figure 2a. However, the control input derived from the motor learning restrains ax from diverging to infinity and, as a result, achieves stability. An important conclusion drawn from our learning theory and simulation result is that the target of sensorimotor learning is not to simply minimize the effects of sensorimotor noise. In fact, the noise effect is not necessarily small even after ADP learning. Removing the motor variation completely requires a control input with extremely high gain, which is both impractical and unnecessary for the human motor system. Instead, the aim here is to regulate the effects of signal-dependent noise properly, so that the motor system can remain stable and achieve acceptable transient performance.
Figure 3:

Adaptation of stiffness ellipse to the DF. Stiffness ellipse before (green) and after (red) adaptation to the DF.

Figure 3:

Adaptation of stiffness ellipse to the DF. Stiffness ellipse before (green) and after (red) adaptation to the DF.

To test the after-effect, we suddenly remove the DF. The after-effect trials are shown in Figures 1c and 2c. Obviously the trajectories are much closer to the y-axis. This is due to the high-gain controller learned in the DF. Here, different from Burdet et al. (2001) and Franklin et al. (2003), we conjecture that during the (at least early phase of) learning process, the CNS, instead of relying on the internal model completely, simply updates the control strategy through online model-free learning. This is because conducting model identification is slow and computationally expensive (Shadmehr & Mussa-Ivaldi, 2012) and thus can provide only limited information to guide motor adaptation in the early phase of learning. On the other hand, visual and motor sensory feedbacks are extremely active during this phase in the motor learning, which in turn provide a large amount of online data to conduct ADP learning. During the later phase of motor learning, a complete internal model has been established, and predictions drawn from the internal model can be incorporated with the visual feedback to provide better estimates of the state.

3.2.2  Velocity-Dependent Force Field

Next, we simulate the experiment in the velocity-dependent force field (VF). Different from DF, here we have (Franklin et al., 2003)
f=χ13-181813vxvy
in equation 2.2, where χ[2/3,1] is a constant that can be adjusted to the subject's strength. In our simulation, we set χ=0.7.

The simulation results are summarized in Figures 4 and 5. Different from the case in DF, the human subject maintains stability throughout the experiment. However, we see from Figures 4a and 5a that the trajectory of the hand is not a straight line and exhibits a large bias to the left-hand side. This bias is caused by the presence of VF. After 50 learning trials, the human subject regains optimality, as the trajectory is approximately a straight line and vy is a bell-shaped curve. The reaching time is also within 0.7 seconds, which is consistent with experimental data (Franklin et al., 2003). This implies that model-free learning also appears in the motor adaptation in VF. Finally, the after-effect is shown in Figures 5c and 4c. Our simulation clearly shows the after-effect in VF, as the hand movement is biased to the opposite side of the VF.

Figure 4:

ADP learning in the VF. Five hand paths are shown in the VF at different stages of the experiment. (a) First five trials on exposure to the VF. (b) Five trials after ADP learning in the VF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Figure 4:

ADP learning in the VF. Five hand paths are shown in the VF at different stages of the experiment. (a) First five trials on exposure to the VF. (b) Five trials after ADP learning in the VF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Figure 5:

ADP learning in the VF. Five sequential trials are shown in the VF at different stages of the experiment. (a) First five trials on exposure to the VF. (b) Five trials after ADP learning in the VF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied. Format as in Figure 2.

Figure 5:

ADP learning in the VF. Five sequential trials are shown in the VF at different stages of the experiment. (a) First five trials on exposure to the VF. (b) Five trials after ADP learning in the VF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied. Format as in Figure 2.

Finally, note that our simulation results in this section are overall consistent with the experimental results provided by different research groups (Burdet et al., 2001; Franklin et al., 2003; Zhou et al., 2016).

In this section, we depart from the classical optimal control framework (Todorov & Jordan, 2002; Liu & Todorov, 2007) and study the sensorimotor control mechanism from a robust and adaptive optimal control perspective. As we discussed in section 1, motor variability is usually caused by different factors. However, system 2.1 to 2.3 only models the motor variation caused by the signal-dependent noise. As another important source of motor variation, the dynamic uncertainty has not been fully considered.

The dynamic uncertainty could be attributed to the uncertainties in the internal model, especially during the early phase of learning, when the internal model may still be under construction. Dynamic uncertainty is an ideal mathematical representation of this modeling error. Moreover, dynamic uncertainty covers the the fixed model error (Crevecoeur et al., 2019) as a special case. Dynamic uncertainty may also come from model reduction (Scarciotti & Astolfi, 2017). Given that the motor system could be a multidimensional, highly nonlinear system, it is too computationally expensive for the CNS to solve the optimal control policy directly. Finding the optimal control policy for a general control system requires solving a nonlinear partial differential equation known as the Hamilton-Jacobi-Bellman (HJB) equation. Due to the curse of dimensionality (Bellman, 1957), solving the HJB equation for high-order systems is hard, if not impossible. Due to this difficulty, we conjecture that the CNS aims only at finding the optimal control policy for a simplified model, which in turn guarantees robustness to the mismatch between this simplified model and the original nonlinear system. As we show below, the presence of dynamic uncertainty does not compromise the stability of the closed-loop system, provided that a certain small-gain condition (Jiang & Liu, 2018; Liu et al., 2014) is satisfied. Moreover, the optimal controller obtained based on the simplified linear model provides similar transient behavior compared with the experimental data, even in the presence of dynamic uncertainty.

4.1  Robust Optimal Control Framework

To take into account the effect of dynamic uncertainty, we rewrite equation 2.3 as
τda=(u-a+Δ)dt+G1(u+Δ)dw1+G2(u+Δ)dw2,Δ:=Δ(ς,x).
(4.1)
Here Δ and ς are the output and state of the dynamic uncertainty. In general, the dynamic uncertainty is a dynamical system interconnected with the nominal system, equations 2.1 to 2.3 (see Figure 6). In particular, ς is unobservable to the CNS.
Figure 6:

Structure of the sensorimotor system subject to dynamic uncertainty.

Figure 6:

Structure of the sensorimotor system subject to dynamic uncertainty.

For simplicity, we assume that Δ enters the sensorimotor control model through the same input channel as signal-dependent noise. The analysis here can be easily extended to the more general case with unmatched disturbance input (see Jiang & Jiang, 2015, and Bian & Jiang, 2018, for more details).

The challenge we face here is that due to the presence of Δ, the optimal controller derived in previous sections may no longer stabilize this interconnected system. In addition, given that ς is unobservable, learning an optimal sensorimotor controller for the full interconnected system is unrealistic. To overcome these challenges, we introduce a new concept of motor learning: robust optimal learning.

First, to account for the disturbance passed from dynamic uncertainty to the CNS, we introduce an extra term in the quadratic cost (2.5)3,
J(x(0);u,Δ)=0xTQx+uTRu-γ2|Δ|2dt,
(4.2)
where γ is a real number satisfying R<γ2I2. γ is called the “gain” of the nominal system in the sense that it models the disturbance Δ on motor system performance. The concept of gain has already been investigated in the sensorimotor control literature (see Prochazka, 1989, for instance).

Here the objective of u and Δ is to minimize and maximize J, respectively. It is clear that equation 4.2, together with system 2.1, 2.2, and 4.1 forms a zero-sum differential game problem. Denote by (uγ*,Δ*) the pair of the optimal controller and the worst-case disturbance with respect to the performance index, equation 4.2. We say uγ* is robust optimal if it not only solves the zero-sum game presented above, but also is robustly stabilizing (with probability one) when the disturbance Δ is presented. To ensure the stability of the motor system, we conjecture that the CNS aims at developing a robust optimal controller by assigning the sensorimotor gain γ properly.

Following the same technique in section 3.1, we can directly adopt algorithm 1 in our robust optimal controller design, except that the input signal now becomes vk=uk+Δ+G1(uk+Δ)ξ1+G2(uk+Δ)ξ2, and an extra term γ-2Fk,12Fk,12T is added in the updating equation of Pk in algorithm 1.

Besides the computational efficiency, an additional benefit of considering a robust optimal controller is that the signal-dependent noise and the disturbance input from the dynamic uncertainty can facilitate motor exploration during the learning phase. Note that if Δ and the signal-dependent noise do not exist, then x and vk become linearly dependent. As a result, the condition on ψj in theorem 4 is no longer satisfied. In fact, these disturbances play a role similar to that of the exploration noise in RL.

4.2  Robust Stability Analysis

In this section, we analyze the stability of the closed-loop system in the presence of signal-dependent noise and dynamic uncertainty. Before giving the robust stability analysis, we first impose the following assumption on the dynamic uncertainty:

Assumption 1.
The dynamic uncertainty is stochastic input-to-state stable (SISS) (Tang & Başar, 2001), and admits a proper4 stochastic Lyapunov function V0, such that
AV0(ς)γ02|x|2-|Δ|2,
where γ00, and A is the infinitesimal generator (Kushner, 1967).

Assumption 1 essentially assumes that dynamic uncertainty admits a stochastic linear L2 gain less than or equal to γ0, with x as input and Δ as output. Using a small-gain type of theorem, we have the following result:

Theorem 3.

For sufficiently small |G1| and |G2|, we have

  1. System 2.4 with u=u* is globally asymptotically stable with probability one.

  2. There exists γ>0, such that uγ* is robust optimal under assumption 1.

The proof of theorem 4 is provided in the appendix. Although theorem 4 requires small |G1| and |G2|, this does not necessarily imply that the variance of the signal-dependent noise is small, since the stochastic noise is also dependent on the system input.

Finally, note that the proposed RADP framework also improves our recent results (Bian & Jiang, 2016, 2018, 2019) by considering both signal-dependent noise and dynamic uncertainty in the synthesis of our learning algorithm. This improvement increases the usability of our learning algorithm in practical applications.

4.3  Simulation Validation

For illustration, we choose the following model to represent dynamic uncertainty,
Tdς=A0ςdt+D3adw3+D4adw4,Δ=γ0ς,
(4.3)
where T>0, γ0R, ς(0)=[00]T, w3 and w4 are independent Brownian motions, and
A0=-1-10.810.8-1,D3=-0.5110.5,D4=10.5-0.51.
In the simulation, we set T=1 and γ0=0.1. Note that T and γ0 are directly related to the SISS gain of the above dynamic uncertainty.

We first simulate the same sensorimotor learning experiment in DF as in section 3.2, with the same parameters. Note that equations 4.2 and 4.3 are considered here. Simulation results under the RADP design are presented in Figures 7 and 8. To reveal the impacts of dynamic uncertainty and compare with the results in section 3.2, we choose a large γ=10 here. Note that even under dynamic uncertainty, both stability and optimality can be achieved under RADP learning. The after-effect is also clearly illustrated. In fact, Figures 7 and 8 are quite similar to the results in section 3.2. However, one noticeable difference in Figure 8 is that the system state has larger variation. Furthermore, the end-point variance is much larger compared with the trajectories in Figure 2, clearly due to the disturbance from dynamic uncertainty. This observation confirms our theoretical analysis that the presence of dynamic uncertainties indeed compromises the stability and robustness of the closed-loop system.

Figure 7:

RADP learning in the DF. Five hand paths are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after RADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Figure 7:

RADP learning in the DF. Five hand paths are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after RADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied.

Figure 8:

RADP learning in the DF. Five sequential trials are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after RADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied. Format as in Figure 2.

Figure 8:

RADP learning in the DF. Five sequential trials are shown in the DF at different stages of the experiment. (a) First five trials on exposure to the DF. (b) Five trials after RADP learning in the DF is complete. (c) Five sequential trials in the postexposure phase when the NF is reapplied. Format as in Figure 2.

To further illustrate the impact of sensorimotor gains, we plot system trajectories after RADP learning under different values of γ and γ0 in Figure 9. Comparing Figures 9a and 9b (and also Figures 9c and 9d), we observe that for a fixed γ0, the smaller γ is, the more stable the motor system is. In other words, the robustness of the motor system can be tuned by including the term γ2|Δ|2 in equation 4.2. By symmetry, for a fixed γ, a smaller γ0 leads to a more stable trajectory, and when γ0 becomes sufficiently large, the dynamic uncertainty has a large input-output gain, thereby giving rise to instability in the closed-loop system (see Figures 9a, 9c, and 9e). When both γ and γ0 are large enough, the motor system may exhibit instability (see Figure 9f). These phenomena are in line with the small-gain theory (Jiang & Liu, 2018).

Figure 9:

Five hand paths after RADP learning in the DF under different L2 gains, γ and γ0.

Figure 9:

Five hand paths after RADP learning in the DF under different L2 gains, γ and γ0.

Nevertheless, the increase of state variation promotes the exploration effect, in the sense that the condition on ψj in theorem 2 can be easily satisfied. In Table 2, we calculate the conditional number of 1lj=1lψjψjT under different γ0. Denote λm and λM as the minimum and maximum eigenvalues of 1lj=1lψjψjT, respectively. We simulate the first learning trial in DF for 0.7 s and calculate the conditional number λM/λm for different choices of γ0. Note that the exploration noise should be chosen so that the closed-loop system is stable and that 1lj=1lψjψjT does not exhibit singularity—that is, the conditional number λM/λm should be small. This way, the control policy can be updated using algorithm 1 with high accuracy. We see from Table 2 that by increasing γ0, the conditional number of matrix 1lj=1lψjψjT is reduced. This, together with Figure 9, illustrates that the motor variability should be properly regulated to promote motor learning.

Table 2:
Conditional Number of 1lj=1lψjψjT.
γ00.0010.11
λM/λm 1.04×1017 4.45×1016 2.07×1014 
γ00.0010.11
λM/λm 1.04×1017 4.45×1016 2.07×1014 

Huang et al. (2011) and Haith and Krakauer (2013) have claimed that model-free learning is a key factor behind the savings. In this section, we investigate the relationship between our adaptive (robust) optimal control approach and the learning algorithms developed in the literature (Smith, Ghazizadeh, & Shadmehr, 2006; Zarahn, Weston, Liang, Mazzoni, & Krakauer, 2008; Vaswani et al., 2015) to explain savings.

5.1  Learning Rate and Error Sensitivity

A key requirement in our learning algorithm is that ε0 (step size) should not be too large, that is, the learning process cannot be arbitrarily fast. This assumption matches the common sense that the human subject usually cannot dramatically improve her motor performance in a single learning trial. As we illustrated in equation 3.1, our learning algorithm is essentially driven by the TD error. Step size is related to sensitivity to the TD error. Since step size is decreasing in our algorithm, error sensitivity is also decreasing. This is because at the initial phase of learning, P0, which represents our prior estimate on P*, is far from P*. Hence, we have to rely more on the TD error feedback from the environment to adjust our estimate on P*. As the trial number increases, Pk becomes closer to P*, and the TD error has less contribution to the learning: the human subject is unwilling to adjust the control policy because the motor outcome is already quite satisfactory.

To further investigate the relationship between motor learning performance and the updating of step sizes, we test the ADP-based sensorimotor learning behavior under different step size. Denote
εk=akc+b,
where a, b, and c are three positive scalars.

To illustrate the influence of step size, we simulate the first 50 learning trials in the DF. For simplicity, we fix a=1 in the simulation. The degree of motor adaptation is measured as |Kk-K*| at the kth trial, which represents the difference between the optimal controller and the controller learned from ADP algorithm. Our simulation result is given in Figure 10. Note that when the step size is small, the learning rate is also small. In this case, motor learning is steady yet slow. Especially, the adaptation curve is smooth, and no oscillation is observed. As we increase the step size, the learning rate starts to increase. However, when the step size is too large, the adaptation curve is no longer smooth and monotone. The adaptation error increases during a short period of time and, as a result, leads to a slower convergence speed. This implies that a large step size may compromise learning speed and accuracy. In fact, when the step size is too large, the learning algorithm becomes unstable, and learning fails.

Figure 10:

ADP learning under different step sizes. Adaptation (norm of the difference between the actual and optimal control gain matrices) as a function of trial number on the introduction of the DF. The decrease in the cost depends on the step size, which is controlled through parameters b and c.

Figure 10:

ADP learning under different step sizes. Adaptation (norm of the difference between the actual and optimal control gain matrices) as a function of trial number on the introduction of the DF. The decrease in the cost depends on the step size, which is controlled through parameters b and c.

5.2  Multirate Model

Smith et al. (2006), Zarahn et al. (2008), and Vaswani et al. (2015) have suggested that savings is a combined effect of two different states (multirate model): the fast state and the slow state. Both states follow linear updating equations in the following form (Smith et al., 2006; Vaswani et al., 2015):
zf(n+1)=αfzf(n)+βfe(n),zs(n+1)=αszs(n)+βse(n),z(n+1)=zs(n+1)+zf(n+1),βf>βs,
where n is the trial number, zf and zs are the fast and slow states, αf and αs are retention factors, βf and βs are the learning rates, and e is the error signal. It has been conjectured that in a washout phase, due to the small learning rate, the slow state may not return to zero, while the fast state can quickly deadapt and show an overshoot such that the net adaptation is zero. As a result, readpatation shows savings due to the nonzero state of the slow learner. Despite vast supporting experimental evidence, the convergence of the above model is still an open problem, and it is still unclear if the human motor system adopts the linear structure in this format. Moreover, the relationship between the above learning model and the kinetic model of human body remains an open issue.
Here, we propose a multirate model based on our ADP learning method. To be specific, in the updating equation of Pk in algorithm 1, we define two states, Pf (fast state) and Ps (slow state), via
PkfPkf+εjf(Fk,11f-Fk,12f(Fk,22f)-1(Fk,12f)T),PksPks+εjs(Fk,11s-Fk,12s(Fk,22s)-1(Fk,12s)T),Pk=αfPkf+αsPks,Kk=αfKkf+αsKks,
where εjfεjs are step sizes; αs,αf(0,1); αs+αf=1; and Fkf and Fks are solved from algorithm 1, with Pk replaced by Pkf and Pks, respectively. Note that our model, after a simple mathematical manipulation, is consistent with the formulation of the multirate model in the literature. It is easy to see that both Pkf and Pks converge to P*. As a result, the convergence of Pk and Kk in the above learning model is guaranteed.

Next, we simulate the motor adaptation and savings behaviors in VF. The measurement criterion of motor adaptation is still chosen as the one in section 5.1: the adaption error at the kth trial is defined as |Kk-K*|. The simulation result is given in Figure 11. First, the human subject conducts 50 trials of motor movement in the VF. Figure 11 shows that Kk gradually converges to the optimal control gain. Next, we simulate 5 washout trials in NF. Then the human subject is reexposed to the VF and conducts another 50 learning trials. Note that the adaptation in the second learning phase is faster than in the first learning phase. During the washout trials, the slow state is not fully reset, and the fast state shows a clear overshoot. Moreover, we see that the fast state curve is not smooth due to the large step size. A similar phenomenon appears in the experimental results in Smith et al. (2006) and Zarahn et al. (2008).

Figure 11:

Adaptation as a multirate model based on ADP learning in VF. Adaptation over trials in a sequence of trials in VF followed by NF (gray shading) and reexposure to VF. The net adaptation (blue) shows savings, which is due to the slow process (red) retaining memory of the initial exposure.

Figure 11:

Adaptation as a multirate model based on ADP learning in VF. Adaptation over trials in a sequence of trials in VF followed by NF (gray shading) and reexposure to VF. The net adaptation (blue) shows savings, which is due to the slow process (red) retaining memory of the initial exposure.

Similar to the case in VF, we also study the motor adaptation and savings behaviors in DF. We can see from Figure 12 that the multirate ADP model also recreates savings behavior in DF.

Figure 12:

Adaptation as a multirate model based on ADP learning in DF. Adaptation over trials in a sequence of trials in DF followed by NF (gray shading) and reexposure to DF. The net adaptation (blue) shows savings, which is due to the slow process (red) retaining memory of the initial exposure.

Figure 12:

Adaptation as a multirate model based on ADP learning in DF. Adaptation over trials in a sequence of trials in DF followed by NF (gray shading) and reexposure to DF. The net adaptation (blue) shows savings, which is due to the slow process (red) retaining memory of the initial exposure.

Compared with the existing literature, the proposed framework has a solid theoretical background, as a detailed convergence analysis can be drawn from our ADP theory. Moreover, our model incorporates the kinetic model of the human motor system into the sensorimotor learning algorithm. As a result, our ADP-based learning algorithm provides a unified framework for the human motor learning system.

6.1  Summary of the Main Results

In this letter, we have investigated human sensorimotor learning from an adaptive optimal control perspective. In particular, the model we have developed shares several similar features with existing results, such as the presence of model-free learning (Huang et al., 2011; Haith & Krakauer, 2013), an alternative source of motor variability (Beck et al., 2012; Bach & Dolan, 2012; Renart & Machens, 2014; Acerbi et al., 2014), and the fact that actively regulated motor variability promotes sensorimotor learning (Renart & Machens, 2014; Wu et al., 2014; Cashaback et al., 2015; Lisberger & Medina, 2015; Pekny et al., 2015; Vaswani et al., 2015). The key idea behind our learning theory is that a specific model is not required, and the motor control strategy can be iteratively improved directly using data from sensory feedback. This learning scheme is especially useful in the early stage of sensorimotor learning, since during this period, the internal model is still under development and the CNS relies more on sensory feedback to fine-tune the motor commands. We have used the proposed learning framework to study the motor learning experiment in both a divergent force field and a velocity-dependent force field. Our model successfully reproduces the experimental phenomena reported in the work of others (Burdet et al., 2001; Franklin et al., 2003; Zhou et al., 2016). In particular, the model, like human subjects, can regain stability and optimality through ADP learning even in an unstable environment with external perturbation. In addition, we have extended our adaptive optimal controller design to tackle the robust optimal control problem caused by the presence of dynamic uncertainties. Dynamic uncertainties may appear in the human motor system as modeling uncertainties (Beck et al., 2012; Renart & Machens, 2014) and dynamic external disturbance (Jiang & Jiang, 2015). Using the robust optimal controller allows us to analyze the influence of dynamic uncertainties on the stability of human motor systems. As we have shown in the simulation, the motor system can still achieve stability in the presence of dynamic uncertainties, provided that a small sensorimotor gain is assigned by the CNS. Moreover, a larger motor variation has been observed as a result of the disturbance input from dynamic uncertainties. Our model shows that this motor variability can contribute to ADP learning and, as a result, promote the sensorimotor learning. Finally, our simulation suggests that the model-free learning may indeed be linked to the savings phenomenon.

6.2  Reinforcement Learning and Adaptive Dynamic Programming

The idea of RL can be traced back to Minsky's PhD dissertation (Minsky, 1954). An essential characteristic of RL is that it provides an efficient way to solve dynamic programming problems without using any modeling information of the underlying system. Due to this advantage, RL has become an ideal candidate to model human decision making and learning behavior (Doya, Samejima, Katagiri, & Kawato, 2002; Dayan & Balleine, 2002; Rangel, Camerer, & Montague, 2008; Glimcher, 2011; Bernacchia, Seo, Lee, & Wang, 2011; Taylor & Ivry, 2014).

Despite the appealing features of RL, it is difficult to show the convergence of the learning scheme or analyze the stability and robustness of the motor system. Moreover, since both the time and the state-action-space are continuous in motor systems, it is not trivial to extend traditional RL techniques to study a sensorimotor control mechanism. Similar to other RL methods, ADP is a non-model-based approach that directly updates the control policy without the need to identify the dynamic model. Different from traditional RL, ADP aims at developing a stabilizing optimal control policy for dynamical systems via online learning. ADP-based optimal control designs for dynamical systems have been investigated by several research groups over the past few years. Compared with the extensive results on RL, the research on ADP, especially for continuous-time dynamical systems, is still underdeveloped. In this letter, we have introduced a novel sensorimotor learning framework built on top of our recent results on continuous-time ADP and RADP methods (Bian & Jiang, 2016, 2018, 2019). These results bypass several obstacles in existing learning algorithms by relaxing the requirement on the initial condition, reducing the computational complexity, and covering a broader class of disturbances.

6.3  Sensorimotor Noise Enhances Motor Exploration

It has been conjectured (Harris & Wolpert, 1998; van Beers et al., 2002; Haruno & Wolpert, 2005) over the past decade that the goal of the motor system is to minimize end-point variance caused by signal-dependent noise. Later, Todorov and Jordan (2002) and Todorov (2004, 2005) further explored this idea by using linear optimal control theory based on the LQR or LQG methods. However, several recent experimental results (Wu et al., 2014; Cashaback et al., 2015) suggest that motor variability facilitates motor exploration and, as a result, can increase learning speed. These surprising results have challenged the optimal control viewpoint in the sense that motor variability is not purely an unwanted consequence of sensorimotor noise whose effects should be minimized.

In this letter, we have justified the contribution of motor variability from a robust/adaptive optimal control perspective based on ADP and RADP theory. Indeed, motor variability serves a similar role as exploration noise, which has been proved essential to ADP and RADP learning. To be specific, if motor variability is regulated properly, we can show mathematically that the system will keep improving motor behavior until convergence. Hence, our model can resolve the inconsistency between existing motor control theories and recent experimental discoveries on motor variability. Moreover, our model also shows that the existence of exploration noise does not destabilize the motor system even for learning tasks in an unstable environment.

6.4  System Decomposition, Small-Gain Theorem, and Quantification of the Robustness-Optimality Trade-Off

A novel feature of the proposed motor learning theory is the viewpoint of system decomposition. Building an exact mathematical model for biological systems is usually difficult. Furthermore, even if the system model is precisely known, it may be highly nonlinear, and it is generally impossible to solve the DP equation to obtain the optimal controller. In this case, simplified models (nominal motor system) are often preferable (Beck et al., 2012). The mismatch between the simplified model and the original motor system is referred to as dynamic uncertainty. Generally dynamic uncertainty involves unmeasurable state variables and unknown system order. After decomposing the system into an interconnection of the nominal model and dynamic uncertainty, we only need to design a robust optimal control policy using partial-state feedback. To handle the dynamic interaction between two subsystems, the robust gain assignment and small-gain techniques (Jiang & Liu, 2018; Liu et al., 2014) in modern nonlinear control theory are employed in the control design. In this way, we can preserve the near-optimality for the motor system, as well as guarantee the robustness of stability for the overall system.

6.5  Comparisons with Other Learning Methods

We note several approaches that also aim at explaining sensorimotor learning mechanisms from a model-free perspective.

Zhou et al. (2016) studied the human motor learning problem using model reference iterative learning control (MRILC). In this framework, a reference model is learned from the data during the initial phase of learning. Then the motor command is updated through the iterative learning algorithm by comparing the different outputs between the reference model and the real-world model. Fundamentally different from the model-free learning presented in this letter, MRILC relies heavily on the knowledge of a reference model, which plays the same role as an internal model. However, it is not clear how the CNS conducts motor learning before establishing the reference model. In addition, the effect of different types of motor variations is still not considered in the MRILC learning scheme. Note that these difficulties do not occur in our learning theory.

An alternative way is to update the motor command directly using error feedback (Herzfeld, Vaswani, Marko, & Shadmehr, 2014; Vaswani et al., 2015; Albert & Shadmehr, 2016). In this model, a difference equation is used to generate the motor prediction. Then, by comparing the predicted motor output with the sensory feedback data, a prediction error is obtained and used to modify the prediction in the next learning trial. By considering different error sensitivities (Herzfeld et al., 2014) and structures (Smith et al., 2006), it is possible to reproduce some experimental phenomena, such as savings, using this model. This model represents the relationship between motor command and prediction error as a static function, and the dynamical system of the kinetic model is ignored. This missing link between optimal control theory (Todorov & Jordan, 2002) and the error-updating model (Herzfeld et al., 2014) raises several open questions, including the convergence problem of the algorithm and the stability issue of the kinetic model. On the other hand, the framework we propose in our letter incorporates the numerical optimization framework into the optimal control design for motor systems. Instead of using the prediction error, our learning model is driven by the TD error, and rigorous convergence analysis has been provided.

Denote V(x)=xTP*x. Following the definitions of K* and P*, by completing the squares, we have
AV(x)=-xTQ+K*TRK*x+xTK*TΣ(P*)K*x,
where Σ(P)=G1TBTPBG1+G2TBTPBG2. For fixed Q and R matrices, P* and K* are fixed. If |G1| and |G2| are sufficiently small, the second term on the right-hand side of the above equality is dominated by the first term. Then x converges to the origin asymptotically with probability one (Kushner, 1967).
To show uγ is robust optimal, we rewrite equation 2.4 as
dx=Axdt+B((u+Δ)dt+G1(u+Δ)dw1+G2(u+Δ)dw2).
Then from the zero-sum game theory, (uγ*,Δ*) is solved as
uγ*=-R-1BTPγ*x-Kγ*x,Δ*=γ-2BTPγ*x,
where Pγ*=Pγ*T>0 is the solution to
0=ATP+PA-PBR-1-γ-2IBTP+Q.
Note that since R<γ2I, Pγ* indeed uniquely exists. Denote Vγ(x)=xTPγ*x. Following the definitions of Kγ* and Pγ*, by completing the squares, we have
AVγ(x)=-xTQ+Kγ*T(R-Σ(Pγ*))Kγ*x-γ-2xTPγ*BBTPγ*x+2ΔT(BTPγ*-Σ(Pγ*)K*)x+ΔTΣ(Pγ*)Δ-xTQ+Kγ*T(R-2Σ(Pγ*))Kγ*x+ΔT(γ2I+2Σ(Pγ*))Δ-α1|x|2+(γ2+α2)|Δ|2,
where α1 and α2 are real constants. Then, for sufficiently small |G1| and |G2|, we can choose γ2<α1/γ02-α2 such that
AV(x,ς)-δ|x|2-δ|Δ|2
for some δ>0, where V(x,ς):=γ0Vγ(x)+α1V0(ς). Thus, both x and Δ converge to the origin asymptotically with probability one (Kushner, 1967). Then, since the dynamic uncertainty is SISS, ς also converges to the origin asymptotically with probability one.
1

For any xRn, denote q(x)=[x12,2x1x2,,2x1xn,x22,2x2x3,,2xn-1xn,xn2]T.

2

For any A=ATRn×n, denote vech(A)=[a11,a12,,a1n,a22,a23,,an-1n,ann]T, where aijR is the (i,j)th element of matrix A.

3

|·| denotes the Euclidean norm for vectors or the induced matrix norm for matrices.

4

A function f:RnR+ is called proper, if lim|x|f(x)=.

This work was partially supported by the U.S. National Science Foundation under grants ECCS-1230040, ECCS-1501044, and EPCN-1903781 to Z.P.J.; and the Wellcome Trust and the Royal Society Noreen Murray Professorship in Neurobiology (to D.M.W.).

Acerbi
,
L.
,
Vijayakumar
,
S.
, &
Wolpert
,
D. M.
(
2014
).
On the origins of suboptimality in human probabilistic inference
.
PLOS Computational Biology
,
10
(
6
),
e1003661EP
.
Albert
,
S. T.
, &
Shadmehr
,
R.
(
2016
).
The neural feedback response to error as a teaching signal for the motor learning system
.
Journal of Neuroscience
,
36
(
17
),
4832
4845
.
Arnold
,
L.
(
1974
).
Stochastic differential equations: Theory and applications
.
New York
:
Wiley
.
Bach
,
D. R.
, &
Dolan
,
R. J.
(
2012
).
Knowing how much you don't know: A neural organization of uncertainty estimates
.
Nature Reviews Neuroscience
,
13
(
8
),
572
586
.
Beck
,
J. M.
,
Ma
,
W. J.
,
Pitkow
,
X.
,
Latham
,
P. E.
, &
Pouget
,
A.
(
2012
).
Not noisy, just wrong: The role of suboptimal inference in behavioral variability
.
Neuron
,
74
(
1
),
30
39
.
Bellman
,
R. E.
(
1957
).
Dynamic programming
.
Princeton, NJ
:
Princeton University Press
.
Bernacchia
,
A.
,
Seo
,
H.
,
Lee
,
D.
, &
Wang
,
X.-J.
(
2011
).
A reservoir of time constants for memory traces in cortical neurons
.
Nature Neuroscience
,
14
(
3
),
366
372
.
Bertsekas
,
D. P.
(
2017
).
Value and policy iterations in optimal control and adaptive dynamic programming
.
IEEE Transactions on Neural Networks and Learning Systems
,
28
(
3
),
500
509
.
Bhushan
,
N.
, &
Shadmehr
,
R.
(
1999
).
Computational nature of human adaptive control during learning of reaching movements in force fields
.
Biological Cybernetics
,
81
(
1
),
39
60
.
Bian
,
T.
, &
Jiang
,
Z. P.
(
2016
).
Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design
.
Automatica
,
71
,
348
360
.
Bian
,
T.
, &
Jiang
,
Z. P.
(
2018
).
Stochastic and adaptive optimal control of uncertain interconnected systems: A data-driven approach
.
Systems and Control Letters
,
115
(
5
),
48
54
.
Bian
,
T.
, &
Jiang
,
Z. P.
(
2019
).
Reinforcement learning for linear continuous-time systems: An incremental learning approach
.
IEEE/CAA Journal of Automatica Sinica
,
6
(
2
),
433
440
.
Bian
,
T.
,
Jiang
,
Y.
, &
Jiang
,
Z. P.
(
2014
).
Adaptive dynamic programming and optimal control of nonlinear nonaffine systems
.
Automatica
,
50
(
10
),
2624
2632
.
Bian
,
T.
,
Jiang
,
Y.
, &
Jiang
,
Z. P.
(
2016
).
Adaptive dynamic programming for stochastic systems with state and control dependent noise
.
IEEE Transactions on Automatic Control
,
61
(
12
),
4170
4175
.
Burdet
,
E.
,
Osu
,
R.
,
Franklin
,
D. W.
,
Milner
,
T. E.
, &
Kawato
,
M.
(
2001
).
The central nervous system stabilizes unstable dynamics by learning optimal impedance
.
Nature
,
414
(
6862
),
446
449
.
Burdet
,
E.
,
Tee
,
K. P.
,
Mareels
,
I. M. Y.
,
Milner
,
T. E.
,
Chew
,
C.-M.
,
Franklin
,
D. W.
, …
Kawato
,
M.
(
2006
).
Stability and motor adaptation in human arm movements
.
Biological Cybernetics
,
94
(
1
),
20
32
.
Cashaback
,
J. G. A.
,
McGregor
,
H. R.
, &
Gribble
,
P. L.
(
2015
).
The human motor system alters its reaching movement plan for task-irrelevant, positional forces
.
Journal of Neurophysiology
,
113
(
7
),
2137
2149
.
Chaisanguanthum
,
K. S.
,
Shen
,
H. H.
, &
Sabes
,
P. N.
(
2014
).
Motor variability arises from a slow random walk in neural state
.
Journal of Neuroscience
,
34
(
36
),
12071
12080
.
Crevecoeur
,
F.
,
Scott
,
S. H.
, &
Cluff
,
T.
(
2019
).
Robust control in human reaching movements: A model-free strategy to compensate for unpredictable disturbances
.
Journal of Neuroscience
,
39
(
41
),
8135
8148
.
Dayan
,
P.
, &
Balleine
,
B. W.
(
2002
).
Reward, motivation, and reinforcement learning
.
Neuron
,
36
(
2
),
285
298
.
Doya
,
K.
(
2000
).
Reinforcement learning in continuous time and space
.
Neural Computation
,
12
(
1
),
219
245
.
Doya
,
K.
,
Samejima
,
K.
,
Katagiri
,
K.-i.
, &
Kawato
,
M.
(
2002
).
Multiple model-based reinforcement learning
.
Neural Computation
,
14
(
6
),
1347
1369
.
Faisal
,
A. A.
,
Selen
,
L. P. J.
, &
Wolpert
,
D. M.
(
2008
).
Noise in the nervous system
.
Nature Reviews Neuroscience
,
9
(
4
),
292
303
.
Flash
,
T.
, &
Hogan
,
N.
(
1985
).
The coordination of arm movements: An experimentally confirmed mathematical model
.
Journal of Neuroscience
,
5
(
7
),
1688
1703
.
Franklin
,
D. W.
,
Burdet
,
E.
,
Osu
,
R.
,
Kawato
,
M.
, &
Milner
,
T.
(
2003
).
Functional significance of stiffness in adaptation of multijoint arm movements to stable and unstable dynamics
.
Experimental Brain Research
,
151
(
2
),
145
157
.
Glimcher
,
P. W.
(
2011
).
Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis
.
Proceedings of the National Academy of Sciences
,
108
(
Suppl. 3
),
15647
15654
.
Haith
,
A. M.
, &
Krakauer
,
J. W.
(
2013
). Model-based and model-free mechanisms of human motor learning. In
M. J.
Richardson
,
M. A.
Riley
, &
K.
Shockley
(Eds.),
Progress in motor control
(pp.
1
21
).
New York
:
Springer
.
Harris
,
C. M.
, &
Wolpert
,
D. M.
(
1998
).
Signal-dependent noise determines motor planning
.
Nature
,
394
(
6695
),
780
784
.
Haruno
,
M.
, &
Wolpert
,
D. M.
(
2005
).
Optimal control of redundant muscles in step-tracking wrist movements
.
Journal of Neurophysiology
,
94
(
6
),
4244
4255
.
He
,
H.
, &
Zhong
,
X.
(
2018
).
Learning without external reward [research frontier]
.
IEEE Computational Intelligence Magazine
,
13
(
3
),
48
54
.
Herzfeld
,
D. J.
,
Vaswani
,
P. A.
,
Marko
,
M. K.
, &
Shadmehr
,
R.
(
2014
).
A memory of errors in sensorimotor learning
.
Science
,
345
(
6202
),
1349
1353
.
Huang
,
V. S.
,
Haith
,
A.
,
Mazzoni
,
P.
, &
Krakauer
,
J. W.
(
2011
).
Rethinking motor learning and savings in adaptation paradigms: Model-free memory for successful actions combines with internal models
.
Neuron
,
70
(
4
),
787
801
.
Huberdeau
,
D. M.
,
Krakauer
,
J. W.
, &
Haith
,
A. M.
(
2015
).
Dual-process decomposition in human sensorimotor adaptation
.
Current Opinion in Neurobiology
,
33
,
71
77
.
Izawa
,
J.
,
Rane
,
T.
,
Donchin
,
O.
, &
Shadmehr
,
R.
(
2008
).
Motor adaptation as a process of reoptimization
.
Journal of Neuroscience
,
28
(
11
),
2883
2891
.
Jiang
,
Y.
, &
Jiang
,
Z. P.
(
2014
).
Adaptive dynamic programming as a theory of sensorimotor control
.
Biological Cybernetics
,
108
(
4
),
459
473
.
Jiang
,
Y.
, &
Jiang
,
Z. P.
(
2015
).
A robust adaptive dynamic programming principle for sensorimotor control with signal-dependent noise
.
Journal of Systems Science and Complexity
,
28
(
2
),
261
288
.
Jiang
,
Y.
, &
Jiang
,
Z. P.
(
2017
).
Robust Adaptive Dynamic Programming
.
Hoboken, NJ
:
Wiley-IEEE Press
.
Jiang
,
Z. P.
, &
Jiang
,
Y.
(
2013
).
Robust adaptive dynamic programming for linear and nonlinear systems: An overview
.
European Journal of Control
,
19
(
5
),
417
425
.
Jiang
,
Z. P.
, &
Liu
,
T.
(
2018
).
Small-gain theory for stability and control of dynamical networks: A survey
.
Annual Reviews in Control
,
46
,
58
79
.
Kawato
,
M.
(
1999
).
Internal models for motor control and trajectory planning
.
Current Opinion in Neurobiology
,
9
(
6
),
718
727
.
Kushner
,
H. J.
(
1967
).
Stochastic stability and control
.
London
:
Academic Press
.
Lewis
,
F. L.
, &
Liu
,
D.
(
2013
).
Reinforcement learning and approximate dynamic programming for feedback control
.
Hoboken, NJ
:
Wiley
.
Lewis
,
F. L.
,
Vrabie
,
D.
, &
Vamvoudakis
,
K. G.
(
2012
).
Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers
.
IEEE Control Systems
,
32
(
6
),
76
105
.
Lisberger
,
S. G.
, &
Medina
,
J. F.
(
2015
).
How and why neural and motor variation are related
.
Current Opinion in Neurobiology
,
33
,
110
116
.
Liu
,
D.
, &
Todorov
,
E.
(
2007
).
Evidence for the flexible sensorimotor strategies predicted by optimal feedback control
.
Journal of Neuroscience
,
27
(
35
),
9354
9368
.
Liu
,
T.
,
Jiang
,
Z. P.
, &
Hill
,
D. J.
(
2014
).
Nonlinear control of dynamic networks
.
Boca Raton, FL
:
CRC Press
.
Minsky
,
M. L.
(
1954
).
Theory of neural-analog reinforcement systems and its application to the brain model problem
. PhD diss., Princeton University.
Morasso
,
P.
(
1981
).
Spatial control of arm movements
.
Experimental Brain Research
,
42
(
2
),
223
227
.
Pekny
,
S. E.
,
Izawa
,
J.
, &
Shadmehr
,
R.
(
2015
).
Reward-dependent modulation of movement variability
.
Journal of Neuroscience
,
35
(
9
),
4015
4024
.
Prochazka
,
A.
(
1989
).
Sensorimotor gain control: A basic strategy of motor systems
?
Progress in Neurobiology
,
33
(
4
),
281
307
.
Qian
,
N.
,
Jiang
,
Y.
,
Jiang
,
Z. P.
, &
Mazzoni
,
P.
(
2012
).
Movement duration, Fitts's law, and an infinite-horizon optimal feedback control model for biological motor systems
.
Neural Computation
,
25
(
3
),
697
724
.
Rangel
,
A.
,
Camerer
,
C.
, &
Montague
,
P. R.
(
2008
).
A framework for studying the neurobiology of value-based decision making
.
Nature Reviews Neuroscience
,
9
(
7
),
545
556
.
Renart
,
A.
, &
Machens
,
C. K.
(
2014
).
Variability in neural activity and behavior
.
Current Opinion in Neurobiology
,
25
,
211
220
.
Scarciotti
,
G.
, &
Astolfi
,
A.
(
2017
).
Data-driven model reduction by moment matching for linear and nonlinear systems
.
Automatica
,
79
,
340
351
.
Shadmehr
,
R.
, &
Mussa-Ivaldi
,
F. A.
(
1994
).
Adaptive representation of dynamics during learning of a motor task
.
Journal of Neuroscience
,
14
(
5
),
3208
3224
.
Shadmehr
,
R.
, &
Mussa-Ivaldi
,
S.
(
2012
).
Biological learning and control: How the brain builds representations, predicts events, and makes decisions
.
Cambridge, MA
:
MIT Press
.
Smith
,
M. A.
,
Ghazizadeh
,
A.
, &
Shadmehr
,
R.
(
2006
).
Interacting adaptive processes with different timescales underlie short-term motor learning
.
PLOS Biology
,
4
(
6
),
e179
.
Sutton
,
R. S.
(
1988
).
Learning to predict by the methods of temporal differences
.
Machine Learning
,
3
(
1
),
9
44
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
2018
).
Reinforcement learning: An introduction
(2nd ed.).
Cambridge, MA
:
MIT Press
.
Tang
,
C.
, &
Başar
,
T.
(
2001
). Stochastic stability of singularly perturbed nonlinear systems. In
Proceedings of the 40th IEEE Conference on Decision and Control
(
vol. 1
, pp.
399
404
).
Piscataway, NJ
:
IEEE
.
Taylor
,
J. A.
, &
Ivry
,
R. B.
(
2014
).
Cerebellar and prefrontal cortex contributions to adaptation, strategies, and reinforcement learning
.
Progress in Brain Research
,
210
,
217
253
.
Todorov
,
E.
(
2004
).
Optimality principles in sensorimotor control
.
Nature Neuroscience
,
7
(
9
),
907
915
.
Todorov
,
E.
(
2005
).
Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system
.
Neural Computation
,
17
(
5
),
1084
1108
.
Todorov
,
E.
, &
Jordan
,
M. I.
(
2002
).
Optimal feedback control as a theory of motor coordination
.
Nature Neuroscience
,
5
(
11
),
1226
1235
.
Uno
,
Y.
,
Kawato
,
M.
, &
Suzuki
,
R.
(
1989
).
Formation and control of optimal trajectory in human multijoint arm movement
.
Biological Cybernetics
,
61
(
2
),
89
101
.
van Beers
,
R. J.
(
2007
).
The sources of variability in saccadic eye movements
.
Journal of Neuroscience
,
27
(
33
),
8757
8770
.
van Beers
,
R. J.
,
Baraduc
,
P.
, &
Wolpert
,
D. M.
(
2002
).
Role of uncertainty in sensorimotor control
.
Philosophical Transactions of the Royal Society of London B: Biological Sciences
,
357
(
1424
),
1137
1145
.
Vaswani
,
P. A.
,
Shmuelof
,
L.
,
Haith
,
A. M.
,
Delnicki
,
R. J.
,
Huang
,
V. S.
,
Mazzoni
,
P.
, …
Krakauer
,
J. W.
(
2015
).
Persistent residual errors in motor adaptation tasks: Reversion to baseline and exploratory escape
.
Journal of Neuroscience
,
35
(
17
),
6969
6977
.
Vrabie
,
D.
,
Vamvoudakis
,
K. G.
, &
Lewis
,
F. L.
(
2013
).
Optimal adaptive control and differential games by reinforcement learning principles
.
London
:
Institution of Engineering and Technology
.
Wang
,
D.
,
He
,
H.
, &
Liu
,
D.
(
2017
).
Adaptive critic nonlinear robust control: A survey
.
IEEE Transactions on Cybernetics
,
47
(
10
),
3429
3451
.
Wolpert
,
D. M.
, &
Ghahramani
,
Z.
(
2000
).
Computational principles of movement neuroscience
.
Nature Neuroscience
,
3
,
1212
1217
.
Wolpert
,
D. M.
,
Ghahramani
,
Z.
, &
Jordan
,
M. I.
(
1995
).
An internal model for sensorimotor integration
.
Science
,
269
(
29
),
1880
1882
.
Wu
,
H. G.
,
Miyamoto
,
Y. R.
,
Castro
,
L. N. G.
,
Olveczky
,
B. P.
, &
Smith
,
M. A.
(
2014
).
Temporal structure of motor variability is dynamically regulated and predicts motor learning ability
.
Nature Neuroscience
,
17
(
2
),
312
321
.
Zarahn
,
E.
,
Weston
,
G. D.
,
Liang
,
J.
,
Mazzoni
,
P.
, &
Krakauer
,
J. W.
(
2008
).
Explaining savings for visuomotor adaptation: Linear time-invariant state-space models are not sufficient
.
Journal of Neurophysiology
,
100
(
5
),
2537
2548
.
Zhou
,
S.-H.
,
Fong
,
J.
,
Crocher
,
V.
,
Tan
,
Y.
,
Oetomo
,
D.
, &
Mareels
,
I. M. Y.
(
2016
).
Learning control in robot-assisted rehabilitation of motor skills—A review
.
Journal of Control and Decision
,
3
(
1
),
19
43
.