## Abstract

This letter proposes a new idea to improve learning efficiency in reinforcement learning (RL) with the actor-critic method used as a muscle controller for posture stabilization of the human arm. Actor-critic RL (ACRL) is used for simulations to realize posture controls in humans or robots using muscle tension control. However, it requires very high computational costs to acquire a better muscle control policy for desirable postures. For efficient ACRL, we focused on embodiment that is supposed to potentially achieve efficient controls in research fields of artificial intelligence or robotics. According to the neurophysiology of motion control obtained from experimental studies using animals or humans, the pedunculopontine tegmental nucleus (PPTn) induces muscle tone suppression, and the midbrain locomotor region (MLR) induces muscle tone promotion. PPTn and MLR modulate the activation levels of mutually antagonizing muscles such as flexors and extensors in a process through which control signals are translated from the substantia nigra reticulata to the brain stem. Therefore, we hypothesized that the PPTn and MLR could control muscle tone, that is, the maximum values of activation levels of mutually antagonizing muscles using different sigmoidal functions for each muscle; then we introduced antagonism function models (AFMs) of PPTn and MLR for individual muscles, incorporating the hypothesis into the process to determine the activation level of each muscle based on the output of the actor in ACRL.

ACRL with AFMs representing the embodiment of muscle tone successfully achieved posture stabilization in five joint motions of the right arm of a human adult male under gravity in predetermined target angles at an earlier period of learning than the learning methods without AFMs. The results obtained from this study suggest that the introduction of embodiment of muscle tone can enhance learning efficiency in posture stabilization disorders of humans or humanoid robots.

## 1  Introduction

Humans exist in an environment controlled by gravity. Without muscle activity, we cannot stand or perform activities of daily living. How humans control their muscles for posture stabilization and intentional motions is one of the major questions in neurology and robotics. In particular, posture stabilization is critical for understanding the mechanisms of human motions, because human motions start from a determined posture that is stabilized in a gravity-controlled environment. Many researchers have exerted valuable efforts to understand how multiple muscles are controlled to realize target postures or target motions. The linear feedback gain control method, such as the proportional integral derivative control law, and optimal control algorithms with cost functions were applied to estimate the activation levels of several muscles for posture stabilization and intentional motions (Rooij, 2011; Kato, Nakahira, Atsumi, & Iwamoto, 2018; Thelen, Anderson, & Delp, 2003). These methods are useful to estimate the activation of multiple muscles in order to hold a target posture, reach a final goal, or follow a predetermined path under a given dynamic environment. For these reasons, these methods cannot be used to achieve robust motion control under unexpected dynamic environments, different from a given dynamic environment.

Reinforcement learning (RL) has recently become attractive as a method that performs action selection by interacting with unknown environments. Among the many methods in RL, the actor-critic model (Barto, 1995), which is presumed to reflect RL in the basal ganglia (Barto, 1995; Doya, 2000b; Morimoto & Doya, 2005), has been used to realize target postures or target motions in expected or unexpected dynamic environments (Kambara, Kim, Sato, & Koike, 2004; Kambara, Kim, Shin, Sato, & Koike, 2009; Iwamoto, Nakahira, Kimpara, Sugiyama, & Min, 2012; Min, Iwamoto, Kakei, & Kimpara, 2018). Kambara et al. (2004) proposed a computational model of feedback-error-learning with actor-critic RL (ACRL) for arm posture control and learning. Their model realized posture stabilization of a human hand after learning from 22,500 trials (each trial continues up to 2 s), but they used a two-link arm model of two joints and six muscles, and no gravity effect was implemented, which enabled easier simulation of realistic motions of the human arm with multiple muscles under dynamic environments, including an environment controlled by gravity. By contrast, Min et al. (2018) proposed a musculoskeletal finite element (FE) model of the human right upper extremity and a muscle control system that consists of ACRL and muscle synergy control strategy. They successfully reproduced arm posture stabilization in unexpected dynamic environments in which a weight was suddenly loaded on the hand under gravity after learning from approximately 700 trials (each trial continued for 2 s). However, because the FE model contained the elastic part of each muscle, including multiple nodes of 6 degrees of freedom and contact definitions between muscles and rigid wrapping shell elements implemented due to reproduction of the muscular action line, iterative calculation of learning took time. In these previous studies on ACRL, very high computational costs, depending on the biofidelity of the human arm model, were needed to obtain simulation results, including those for muscle control strategies for arm posture stabilization. Although ACRL can be used for posture stabilization of computational human body models with multiple muscles or humanoid robots including muscular structures, especially for posture stabilization of a humanoid robot, efficient ACRL is critical to achieve high performance in robots with motion controls with an online learning process.

Some studies have recently been conducted to reduce computational costs in RL (Silver et al., 2014; Popov et al., 2018; Andrychowicz et al., 2017). Silver et al. (2014) proposed a deep deterministic policy gradient algorithm (DDPG) for efficient ACRL with continuous actions and applied it to an octopus arm task, the goal of which was to strike a target with any part of the arm consisting of six segments and attached to a rotating base. DDPG successfully realized efficient learning with 50 continuous state variables and 20 action variables and controlled three muscles in each segment, as well as rotations of the base. Popov et al. (2018) used DDPG with a model-free Q-learning-based method to design reward function and realized dexterous manipulations of robot hands with a small number of trials as intended by designers. Andrychowicz et al. (2017) proposed DDPG with a method called hindsight experience replay that increases teaching signal using failure experiences for learning and then achieved complicated behaviors of a robot arm in a small number of trials. In these previous studies (Popov et al., 2018; Andrychowicz et al., 2017), robot arms with 7 or 9 degrees of freedom were used for manipulating objects in the MuJoCo physics engine (Todorov, Erez, & Tassa, 2012), for example, picking up a ball or a Lego brick and moving it to a goal position. Although the controllers using RL and an octopus arm or robot arms interacted in a dynamic environment, the methodology to realize efficient learning was focused on the internal control system with RL, corresponding to the brain. By contrast, there is increasing interest in the effects of embodiment on intelligent behavior and cognition (Pfeifer, Lungarella, & Iida, 2007; Hoffmann & Pfeifer, 2012). Hoffmann and Pfeifer (2012) argued that embodiment can improve the cognitive functions of artificial intelligence. For example, passive dynamic walkers are capable of walking down an incline path without any actuation and without control. Without any motors and any sensors, the walker with mainly leg segment lengths, mass distribution, and foot shape can realize walking with the influence of gravity as the only power source. This indicates that embodiment can achieve efficient control of walking or balancing in dynamic environments. Therefore, to identify an efficient RL method for human arm posture stabilization under gravity, we developed an ACRL model to the control activation of multiple muscles for human arm posture stabilization under gravity, in which we introduced a musculoskeletal model of a human upper extremity and antagonism function models (AFMs) for embodiment to achieve efficient learning for human arm posture stabilization.

## 2  Method

In a previous study, Min et al. (2018) developed a muscle control system that consisted of ACRL and muscle synergy control strategy and reproduced arm posture stabilization using a musculoskeletal FE model. In this study, we also used ACRL for posture stabilization of a computational human arm model under gravity. Figure 1 shows the architecture of ACRL and a musculoskeletal model of the human right upper extremity used in this study.
Figure 1:

Architecture of actor-critic reinforcement learning (ACRL).

Figure 1:

Architecture of actor-critic reinforcement learning (ACRL).

### 2.1  Musculoskeletal Model of the Human Right Upper Extremity

In this study, we developed a musculoskeletal model of the right upper extremity of a human adult male using Matlab (MathWorks, U.S.A.) as shown on the right side of Figure 1. The skeletal parts of the upper extremity model were divided into five parts—the scapula, humerus, ulna, radius, and hand—which were modeled using rigid bodies. The inertia properties of $Ixx,Iyy$, and $Izz$ and masses of rigid bodies simulating skeletal parts of the upper extremity model are listed in Table 1. These data were obtained from an FE model of the human body that we developed previously (Iwamoto et al., 2012). Inertia properties of $Ixy,Iyz$, and $Izx$ were set to 0.0 for all rigid bodies. The elbow joint was modeled using a mechanical joint that can represent two elbow joint motions, namely, flexion-extension and inversion-eversion, whereas the shoulder joint was modeled using the same kind of mechanical joint that can represent three shoulder joint motions, namely, internal-external rotation, flexion-extension, and inversion-eversion. The musculoskeletal model has 20 muscles: deltoid anterior, deltoid middle, deltoid posterior, teres major, teres minor, supraspinatus, infraspinatus, subscapularis, coraco brachialis, biceps brachii (long head and short head), triceps brachii (long head, lateral head, medial head), brachialis, brachioradialis, pronator teres, anconeus, supinator, and pronator quadratus. Each muscle was modeled using the Hill-type muscle model that includes a contractile element and a parallel elastic element according to Zajac (1989). The muscle activation level $u$ was associated with the muscular force of a muscle $m$, $fm$ using the following equation:
$fm=fmax(ufLfV+fPE)cosα,$
(2.1)
$fL=exp-(l¯m-1)2SL,$
(2.2)
$fV=0(v¯m<-1),1+v¯m1-v¯m/Af(-1≤v¯m<0),(Bf-1)+v¯m(2+2/Af)Bf(Bf-1)+v¯m(2+2/Af)(0≤v¯m),$
(2.3)
$fPE=0(l¯m<1),exp(kPE(l¯m-1)/e0)-1exp(kPE)-1(1≤l¯m)$
(2.4)
where $l¯m=lm/lm0,v¯m=l˙m/vmax$ are the normalized length and normalized contractile velocity of a muscle $m$, respectively. Parameters of $SL,Af,Bf,kPE$, and $e0$ were determined as $SL=0.45,Af=0.25,Bf=1.4,kPE=5.00$, and $e0=0.60$ based on Thelen (2003). $fmax$ (N), $α$ (deg), $lm0$ (m), and $vmax$ (m/s) are the maximum contractile force, pennation angle, optimal fiber length, and maximum contractile velocity, respectively. $fmax$ was determined by $fmax=σmkg$. $σm$ represents the physiological cross-sectional area (PCSA) of a muscle $m$, a coefficient $k=5.5$ (kg/cm$2$) according to Gans (1982), and g = 9.8 (m/s$2$) of gravitational acceleration. $vmax$ was determined by $vmax=10lm0$ according to Thelen (2003). The PCSA of each muscle was determined based on the study by Winters (1990). $α$ and $lm0$ were determined based on the methods of Murray, Buchanan, and Delp (2000). Parameters of PCSA, pennation angle, and optimal fiber length used in the musculoskeletal model of human arm are listed in Table 2.
Table 1:
Inertia Properties and Masses of Rigid Bodies Simulating Skeletal Parts of the Human Right Arm Used in This Study.
Rigid Body$Ixx$ [kg$*$mm$2$]$Iyy$ [kg$*$mm$2$]$Izz$ [kg$*$mm$2$]Mass [kg]
Scapula 3.432 4.166 5.097 1.27
Humerus 2.496 16.393 17.228 1.69
Ulna 0.028 0.628 0.639 0.14
Hand 0.575 1.206 1.351 0.54
Rigid Body$Ixx$ [kg$*$mm$2$]$Iyy$ [kg$*$mm$2$]$Izz$ [kg$*$mm$2$]Mass [kg]
Scapula 3.432 4.166 5.097 1.27
Humerus 2.496 16.393 17.228 1.69
Ulna 0.028 0.628 0.639 0.14
Hand 0.575 1.206 1.351 0.54

Source: Iwamoto et al. (2012).

Table 2:
Parameters of Human Arm Musculoskeletal Model Used in This Study.
MusclePCSA [mm$2$]Pennation Angle [deg]Optimal Fiber Length [mm]
Deltoid anterior 546 22.0 193.5
Deltoid middle 1000 15.0 165.1
Deltoid posterior 469 18.0 190.5
Teres major 497 16.0 121.9
Teres minor 244 24.0 104.1
Supraspinatus 770 9.0 120.0
Infraspinatus 1200 11.5 135.0
Subscapularis 2000 12.9 126.0
Coraco brachialis 167 27.0 185.4
Biceps brachii long head 413 0.0 270.0
Biceps brachii short head 396 2.5 230.0
Triceps brachii long head 800 10.0 312.4
Triceps brachii lateral head 1050 10.0 246.4
Triceps brachii medial head 610 17.0 213.4
Brachialis 948 4.0 199.0
Pronator teres 437 10.0 160.0
Anconeus 200 0.0 58.0
Supinator 395 0.0 57.0
MusclePCSA [mm$2$]Pennation Angle [deg]Optimal Fiber Length [mm]
Deltoid anterior 546 22.0 193.5
Deltoid middle 1000 15.0 165.1
Deltoid posterior 469 18.0 190.5
Teres major 497 16.0 121.9
Teres minor 244 24.0 104.1
Supraspinatus 770 9.0 120.0
Infraspinatus 1200 11.5 135.0
Subscapularis 2000 12.9 126.0
Coraco brachialis 167 27.0 185.4
Biceps brachii long head 413 0.0 270.0
Biceps brachii short head 396 2.5 230.0
Triceps brachii long head 800 10.0 312.4
Triceps brachii lateral head 1050 10.0 246.4
Triceps brachii medial head 610 17.0 213.4
Brachialis 948 4.0 199.0
Pronator teres 437 10.0 160.0
Anconeus 200 0.0 58.0
Supinator 395 0.0 57.0

Source: Winters (1990); Murray et al. (2000).

The moment arm of a muscle was determined by the muscle's line of action for the joint position and represents the relationship between muscular force and joint motion. The biofidelity of the musculoskeletal model was validated by comparisons between the moment arm of each muscle predicted by the model and that obtained from experimental test data using human subjects. In this study, we created the lines of action of the 20 muscles by referring to the surface data of the muscles obtained from anatomical models of ZygoteBody (Zygote Media Group, U.S.A.). The moment arm vector of a muscle $m$ was defined as a normal vector from the center of the elbow or shoulder joint to the line of action of the muscular force using $rm$ as shown in Figure 2a, and the moment around the elbow or shoulder joint $τ$ was calculated using the contractile force of a muscle $m$, $fm$, as follows:
$τ=rm×fm,$
(2.5)
where $×$ represents the outer product of the vectors. According to the principle of virtual work, the moment arm $|rm|$ can be calculated using the following equation:
$|rm|=Δlm/Δθ,$
(2.6)
where $Δθ$ is the differential of joint angle $θ$ and $Δlm$ is the differential of muscle length.
Figure 2:

Comparisons of moment arm versus elbow or shoulder flexion angles between model prediction and test data.

Figure 2:

Comparisons of moment arm versus elbow or shoulder flexion angles between model prediction and test data.

The elbow and shoulder joint motions can be calculated using a forward dynamic method that solves an equation of motion in the musculoskeletal model representing the following differential algebraic equation (Nikravesh, 1988)
$MPTBTP00B00q¨σλ=g-bc-2α*Ψ˙-β*2Ψγ-2αΦ˙-β2Φ,$
(2.7)
where $q¨$ is the generalized acceleration and the generalized coordinate of the rigid body $i$ is represented by $qi=[xiTpiT]T$ ($T$:Transpose) including a coordinate system of each rigid body position $xi$ that has principal axes of the inertia as coordinate axes, a center of gravity of each rigid body as an origin of the coordinate, and a posture expression $pi$ using Euler parameters that represent rotating postures with four variables (Nikravesh, 1988). $M$ is an inertia matrix based on the inertial property of rigid bodies representing skeletal parts. $g$ represents generalized forces including muscular forces obtained from equation 2.1 of each muscle $m$ and gravity force calculated as gravitational acceleration multiplied by a mass of each rigid body. $b$ indicates centrifugal forces and Coriolis forces. $P$ and $c$ are a coefficient matrix and a constant term obtained by second-order time derivatives of constraint conditions of Euler parameters, respectively, whereas $B$ and $γ$ are a coefficient matrix and a constant term obtained by second-order time derivatives of constraint conditions based on the joint location and its degree of freedom, respectively. $σ$ and $λ$ are Lagrange multipliers. $Ψ$ and $Ψ˙$ are constraint conditions of Euler parameters and velocity constraint conditions by its time derivative, respectively, while $Φ$ and $Φ˙$ are constraint conditions of the joint and its velocity constraint conditions, respectively. $α*,β*,α$, and $β$ are weight coefficients adjusting the specific weight of each constraint condition.
Figure 3:

Normalized gaussian network and base function.

Figure 3:

Normalized gaussian network and base function.

In the forward dynamic calculation, coefficient matrices, $P$ and $B$, on the left side of equation 2.7 and constant terms, $c$ and $γ$, on the right side of the equation are calculated using the generalized coordinate and velocity, $qt,q˙t$ at each input time $t$ for each muscular force and gravity force of each rigid body, with reference to the study of Nikravesh (1988). Then the generalized acceleration $q¨t$ is obtained by solving equation 2.7 using the mldivide function of Matlab. The elbow joint motion can be obtained by calculating the generalized velocity and coordinate, $q˙t+Δt,qt+Δt$ sequentially at the next time $t+Δt$ using a solver of the ordinary differential equation of Matlab, ode113. In the simulation, initial values of the generalized coordinate are given by inputting the joint angle at the initial time, and the initial value of the generalized velocity is given as zero in case of a static situation. The joint angle is obtained as a Euler angle by calculating a homogeneous transformation matrix of the generalized coordinates $qt$ of two rigid bodies connected via the joint.

### 2.2  ACRL Method

In this study, we implemented ACRL, one of the methods using temporal difference (TD) learning, to acquire a muscle control policy for posture stabilization under unknown environments. A control network, called actor, and an evaluation network, called critic, are used in the actor-critic method as shown in Figure 1. Each network is constructed using a three-layer neural network including the input layer that consists of a state variable $s(t)$ as shown in Figure 3a. The critic network predicts value function $V(s(t))$, and the actor network acquires control policy $a(t)$ that maximizes the value function $V(s(t))$ through learning trials using the critic and actor networks, respectively. In this study, the critic and actor networks were implemented using the normalized gaussian network (NGnet) and a continuous-time formulation of RL (Doya, 2000a) because we target posture stabilization of the human arm with multidimensional degrees of freedom, and the state variable $s(t)$ should be defined in continuous and multidimensional state spaces. In the NGnet, the continuous state space was modeled using the gaussian soft-max network that can generalize the state space by extrapolation even out of range in a base function of the radial basis function network (Shibata & Ito, 1999).

We set the 10 state spaces using the difference $dθ(t)$ between current angle $θ(t)$ and target angle $θtrg$ and the difference $dθ˙(t)$ between current angular velocity $θ˙(t)$ and target angular velocity $θ˙trg$ as $s(t)=(dθELV(t),dθ˙ELV(t),dθELW(t),dθ˙ELW(t),dθSHU(t),dθ˙SHU(t),dθSHV(t),dθ˙SHV(t),dθSHW(t),dθ˙SHW(t))$. ELV, ELW, SHU, SHV, and SHW represent flexion-extension of the elbow joint, inversion-eversion of the elbow joint, internal-external rotation of the shoulder joint, flexion-extension of the shoulder joint, and inversion-eversion of the shoulder joint, respectively. According to anatomical text (Neumann, 2010), angle ranges of ELV, ELW, SHU, SHV, and SHW were set from $-$135 to 17 degrees, from 0 to 180 degrees, from $-$120 to 40 degrees, from $-$170 to 50 degrees, and from $-$90 to 70 degrees, respectively. Using NGnet, the state value function $V(s(t))$ in the critic and the action value function $am(s(t))$ for the $m$th muscle in the actor are represented as follows:
$V(s(t))=∑k=1KwkVbk(s(t)),$
(2.8)
$am(s(t))=∑k=1Kwkabk(s(t)),$
(2.9)
where $bk(s(t))$ denotes base function and is represented by the following equations:
$bk(s(t))=Bk(s(t))∑l=1KBl(s(t)),Bk(s(t))=exp-∑i=1nsi(t)-ciσbi2,$
(2.10)
where $ci$ denotes the coordinates ($dθ,dθ˙$) of the center of activation function, and $σbi,K$, and $n$ represent a constant, the number of base functions, and the number of states $s(t)$, respectively.

In this study, we treated five joint motions of the human arm, including elbow joint motions with 2 degrees of freedom and shoulder joint motions with 3 degrees of freedom and set the neutral angles of ELV, ELW, SHU, SHV, and SHW to $-$58, 54, $-$39, $-$36, and 36 degrees, respectively by referring to space attitude reported by NASA (Tengwall et al., 1982), because each muscle is supposed to have its equilibrium length in the space attitude. However, because the musculoskeletal model had $-$30 degrees of the ELV angle initially, the neutral angle of ELV was modified to $-$88 degrees to achieve the space attitude in this study. In addition, the angle difference $dθ(t)$ between the current angle and target angle ranged from $-$70 degrees to 70 degrees, and the angular velocity difference $dθ˙(t)$ between the current angular velocity and target angular velocity ranged from $-$300 degrees/sec to 300 degrees/sec, as shown in Figure 3b. Twelve centers of activation function that were indicated as black circles in Figure 3b were set in each axis of the angle difference $dθ(t)$ and angular velocity difference $dθ˙$, and the number of base functions was set to 144.

In the environment in which elbow joint motions with 2 degrees of freedom and shoulder joint motions with 3 degrees of freedom can be performed under gravity using the musculoskeletal model of the right upper extremity developed using Matlab, the agent observes the current state $s(t)$, that is, the angle and angular velocity differences of five joint motions of the arm and determines the activation level $u(t)$ input for each muscle of the musculoskeletal model to stabilize the posture at the predetermined target joint angles. The target angles of five joint motions were determined using the space attitude, in which $θELVtrg=-88.0,θELWtrg=54.0,θSHUtrg=-39.0,θSHVtrg=-36.0,θSHVtrg=36.0$, and the target angular velocities of the five joint motions were set to zero for posture stabilization. Then, the agent obtains reward $r(t)$ described by equations 2.11 to 2.13 from the environment:
$r(s(t))=rp(s(t))-cru(t),$
(2.11)
$rp(s(t))=exp-dθELVσr2+exp-dθ˙ELVσr2+exp-dθELWσr2+exp-dθ˙ELWσr2+exp-dθSHUσr2+exp-dθ˙SHUσr2+exp-dθSHVσr2+exp-dθ˙SHVσr2+exp-dθSHWσr2+exp-dθ˙SHWσr2,$
(2.12)
$ru(t)=∑m=1Num(t)2,$
(2.13)
where $c$, $σr$, $um(t)$, and $N$ denote the weight of $ru(t)$, a constant, muscle activation level of the $m$th muscle, and the total number of muscles, respectively. The reward function $r(s(t))$ is represented by the first term $rp(s(t))$ that is set to minimize $dθ$ and $dθ˙$ of each joint motion and the second term $ru(t)$ that is set to minimize the activation level $u(t)$, according to Kambara et al. (2004) and Min et al. (2018).
The critic network outputs the value function $V(s(t))$ from the current state $s(t)$ using equation 2.8 and learns to zeronize the prediction error, that is, the TD error $δ(t)$ described by equation 2.14,
$δ(t)=r(s(t))+γV(s(t+1))-V(s(t))=r(s(t))+1-ΔtτV(s(t+1))-V(s(t)),$
(2.14)
where $γ$ denotes the discount factor that ranges from 0 to 1 and $τ$ denotes a time constant of evaluation. In the calculation of the TD error $δ(t)$ using equation 2.14 with the online learning that sequentially learns at every time step, an approach of the backward Euler approximation of time derivative $V˙(s(t))$ using eligibility trace $ek(t)$ updated by equation 2.15 is often utilized (Doya, 2000a),
$e˙k(t)=-1κek(t)+∂V(s(t))∂wkV,$
(2.15)
where the symbol $κ$ denotes a time constant of the eligibility trace. The value function $V(s(t))$ is updated by equation 2.16 including the eligibility trace $ek(t)$,
$ΔV(s(t))=αVδ(t)ek(t),$
(2.16)
where $αV$ denotes the learning rate of the critic. Then the TD error $δ(t)$ is calculated using equation 2.14.
The actor network outputs the action value function $am(s(t))$ for the $m$th muscle from the current state $s(t)$ using equation 2.9 and learns to increase the value function $V(s(t))$ and maximize the expected value of cumulative reward function. In the calculation of the action value function $am(s(t))$ in equation 2.9, the weight of the actor value function $wka$ is updated by equation 2.19 including the TD error $δ(t)$. The activation level of the $m$th muscle $um(t)$ is obtained using equation 2.17, including the weight of the action value function $am(s(t))$, that is, $wka$ according to Min et al. (2018),
$um(t)=ummaxsig-A∑k=1K(wka)mbk(s(t))+σ(s(t))nm(t)-B,$
(2.17)
$ummax=1.0,σ(s(t))=exp(-0.5V(s(t))),$
(2.18)
$Δ(wka)m=αaδ(t)nIG(t)σ(V(s(t))∂am(s(t))∂(wka)m,$
(2.19)
$nIG(t)=νInI(t)+νGnG(t),νG=σ(s(t)),νI=1-νG,$
(2.20)
where $ummax$ is the maximum value of the activation level of the $m$th muscle, $sig()$ denotes the sigmoid function, and $A$ and $B$ are constants of the sigmoid function. Moreover, $αa$ denotes the learning rate of the actor. $nm(t)$ is the white noise function, which is randomly determined for each muscle $m$ from 0 to 1 at every time step to explore the control output. $nIG(t)$ is the white noise function to allocate the weight variation $Δ(wka)m$ to $um(t)$ of the individual muscles, which were introduced by Min et al. (2018) as equations 2.20 with two white noise functions $nI(t)$ and $nG(t)$ to simulate the muscle synergy strategy. The musculoskeletal model of the upper extremity has 20 muscles that control the elbow and shoulder joints. According to Neumann (2010), these muscles were classified into 12 groups based on the innervation of the peripheral nervous system and roles of each muscle. The deltoid anterior was assigned to group 1, deltoid posterior to group 2, and deltoid middle and teres minor to group 3, which are related to the axillary nerve. The teres major and subscapularis were assigned to group 4, supraspinatus and infraspinatus to group 5, which are related to subscapular nerve. The coracobrachialis was assigned to group 6, and the long head and short head of biceps brachii and brachialis were assigned to group 8, which are related to the musculocutaneous nerve. The brachioradialis was assigned to group 7, the triceps brachii long head, lateral head, and medial head and anconeus to group 10, and the supinator to group 11, which are related to the radial nerve. The pronator teres was assigned to group 9 and pronator quadratus to group 12, which are related to the medial nerve. For individual control, $nI(t)$ were randomly determined for each muscle $m$ from 0 to 1 at every gaussian base function and every time step while $nG(t)$ were randomly determined for each group from 0 to 1 at every time step. According to Min et al. (2018), we introduced $nIG(t)$ in equations 2.19 and 2.20, where $νI$ and $νG$ indicate the individual control signal and group control signal, respectively, to represent muscle synergy. The learning of synergy between $νI$ and $νG$ is processed with the assumption that the sum of the two components is 1.0. At the initial learning stage, $νG$ and $νI$ start individually at 1.0 and 0.0, respectively. Then $νG$ decreases, while $νI$ increases along with the learning process.

### 2.3  Antagonism Function Models

ACRL can realize posture stabilization of the human arm under gravity. Actually, the muscle control system developed by Min et al. (2018) almost returned to the initial elbow joint angle and held the posture after the weight was put on the hand. However, the system requires longer computational time to acquire muscle control policy for posture stabilization by learning approximately 700 trials. Because they aimed to simulate how a baby acquires muscle control policy in the learning process of the baby's growth, they classified muscles into four groups; the brachioradialis were assigned to group 1, the long head and short head of biceps brachii and brachialis to group 2, the pronator teres to group 3, and the triceps brachii long head, lateral head, and medial head and anconeus to group 4 based on the innervation of the peripheral nervous system in their representation of the actor value function, but did not consider how the flexors or extensors work for posture stabilization at the current joint angle and did not use any functional expression of the flexors or extensors. However, mutually antagonizing muscles such as the flexors and extensors have fundamental functions to stabilize the posture at a target angle. When the joint angle is extended more than the target angle, the flexors increase the muscle activation level and the extensors decrease it to stabilize the posture on the target angle. By contrast, when the joint angle is flexed more than the target angle, the flexors decrease the muscle activation level and the extensors increase it.

By referring to a series of neurophysiological experimental studies using decerebrate cats, Takakusaki (2017) reported the presence of GABAergic output pathways from the substantia nigra reticulate (SNr) of the basal ganglia to the pedunculopontine tegmental nucleus (PPTn) and the midbrain locomotor region (MLR) in the brain stem, in which the lateral part of the SNr blocks the PPTn-induced muscle tone suppression, whereas the medial part of the SNr suppresses the MLR-induced locomotion or muscle tone promotion. Takakusaki (2017) also suggested that the muscle tone suppression in the PPTn and the muscle tone promotion in the MLR are induced in both flexors and extensors.

By contrast, Doya (2000b) depicted a schematic diagram of the cortico-basal ganglia loop and the possible roles of its components in an RL model (see Figure 4). The neurons in the striatum predict the future reward for the current state and possible actions. The error in the prediction of future reward, that is, TD error, is encoded in the activity of dopamine neurons and is used for the learning of cortico-striatal synapses. Doya (2000b) suggested that one of the candidate actions is selected in the pathway through the SNr and globus pallidus to the thalamus and the cerebral cortex as a result of the competition of predicted future rewards.

Based on these two studies, we hypothesized that both PPTn and MLR modulate the maximum values of the activation levels of mutually antagonizing muscles such as the flexors and extensors, adductors and abductors, and invertors and evertors, in which the activation levels are signals from the SNr to the brain stem, that is, the output of an actor of ACRL, as shown in Figure 4. Using the maximum value of the activation level of the $m$th muscle $ummax$ in equation 2.17, we introduced two types of antagonism function models (AFMs) of PPTn and MLR for mutually antagonizing muscles, representing the hypothesis into the process to determine the activation level of each muscle based on the output of the actor in ACRL.

Figure 4:

A schematic diagram of the cortico-basal ganglia-brain stem pathways in motor function and the possible roles of its components in a reinforcement learning model. This schematic diagram was created by modifying a schematic diagram originally depicted by Doya (2000b).

Figure 4:

A schematic diagram of the cortico-basal ganglia-brain stem pathways in motor function and the possible roles of its components in a reinforcement learning model. This schematic diagram was created by modifying a schematic diagram originally depicted by Doya (2000b).

The first AFM was described based on the angle differences of the five joint motions by equations 2.21 to 2.40 by referring to anatomical texts (e.g., Neumann, 2010):
$u1:Deltoidanteriormax=sig(-0.5dθSHV+0.5dθSHW),$
(2.21)
$u2:Deltoidmiddlemax=sig(-0.5dθSHU),$
(2.22)
$u3:Deltoidposteriormax=sig(0.2dθSHV-0.2dθSHW),$
(2.23)
$u4:Teresmajormax=sig(0.5dθSHU+0.5dθSHW),$
(2.24)
$u5:Teresminormax=sig(-0.5dθSHW),$
(2.25)
$u6:Supraspinatusmax=sig(-0.2dθSHU),$
(2.26)
$u7:Infraspinatusmax=sig(-0.5dθSHW),$
(2.27)
$u8:Subscapularismax=sig(0.5dθSHU+0.5dθSHW),$
(2.28)
$u9:Coracobrachialismax=sig(-0.2dθSHV+0.2dθSHW),$
(2.29)
$u10:Bicepsbrachiilongmax=sig(-0.5dθSHV-0.2dθSHW-0.5dθELV+0.2dθELW),$
(2.30)
$u11:Bicepsbrachiishortmax=sig(-0.5dθSHV-0.2dθSHW-0.5dθELV+0.2dθELW),$
(2.31)
$u12:Tricepsbrachiilongmax=sig(0.5dθELV),$
(2.32)
$u13:Tricepsbrachiilateralmax=sig(0.5dθELV),$
(2.33)
$u14:Tricepsbrachiimedialmax=sig(0.5dθELV),$
(2.34)
$u15:Brachialismax=sig(-0.5dθELV),$
(2.35)
$u16:Brachioradialismax=sig(-0.5dθELV),$
(2.36)
$u17:Pronatorteresmax=sig(-0.2dθELV-0.5dθELW),$
(2.37)
$u18:Anconeusmax=sig(0.2dθELV-0.2dθELW),$
(2.38)
$u19:Supinatormax=sig(0.5dθELW),$
(2.39)
$u20:Pronatorquadratusmax=sig(-0.5dθELW).$
(2.40)
The constants of the sigmoid function $sig()$ were set to 0.5 and 0.2 for the agonist muscles and synergist muscles, respectively. The value of 0.5 was determined based on the volunteer test data on muscle strength and muscle activations of flexors and extensors of the elbow joint motion during the performance of isometric exercise as reported by Yang et al. (2014). The value of 0.2 was determined by considering the ratios of activation levels of synergist muscles to those of agonist muscles obtained from experimental test data using electromyography (Iwamoto et al., 2012).
The second AFM was described based on the length rate $Δlm=(lm-lm0)/lm0$ of each muscle $m$ by the following equation:
$ummax=sig(-500.0Δlm+5.0).$
(2.41)
The $lm$ and $lm0$ are the current length and the equilibrium length of each muscle $m$, respectively. In this study, the equilibrium length of each muscle was determined as the length of each muscle when the right arm had the space attitude. The constants of the sigmoid function $sig()$ were determined to be 500.0 and 5.0 to simulate quick activation of each muscle extending more than the equilibrium length and simulating zero forces when each muscle contracts less than the equilibrium length, respectively.

### 2.4  Simulation Conditions

We implemented the algorithm of ACRL using Python 3.7 to perform parametric simulations on posture stabilization of five joint motions of the human arm under gravity. In this study, degrees of freedom of the wrist joints of the musculoskeletal model were constrained, and the five joint motion angles calculated using Matlab were output. The time step of the calculation was 0.01 s. For robust RL in a model-free fashion, the initial joint angles of flexion-extension and inversion-eversion of the elbow joint are determined randomly from $-$110 degrees to $-$10 degrees and from 10 degrees to 90 degrees, respectively, while those of internal-external rotation, flexion-extension, and inversion-eversion of the shoulder joint are determined randomly from $-$60 degrees to $-$20 degrees, from $-$90 degrees to 10 degrees, and from 10 degrees to 60 degrees, respectively. In each trial of learning, the arm motion was calculated on Matlab under gravity using a musculoskeletal model with the determined initial angles. The muscle activation level of each muscle $um(t)$ provided by the actor at time $t$ is input to the corresponding muscle of the musculoskeletal model of the right upper extremity, and the five joint motion angles and the length of each muscle are then calculated on Matlab. Based on the state $s(t)$ that consists of $dθ$ and $dθ˙$ of each joint motion obtained from Matlab, the value function $V(s(t))$ and reward function $r(s(t))$ are calculated, and the activation level of each muscle at the next time $t+1$ is then calculated, which is repeated until the predetermined end condition is satisfied. In this study, one trial was finished at 2.0 s, which was the termination time of the arm motion simulation and was defined as the end condition. This learning process was repeated until the predetermined total number of initial angles, which was set to 300 in this study, was attained.

We performed the learning method under four simulation conditions: case 1 used ACRL with the first AFM of equations 2.21 to 2.40 (hereafter called ACRL embod); case 2 used ACRL with the second AFM of equation 2.41 (hereafter called ACRL mlembod); case 3 used ACRL without any AFMs (hereafter called ACRL noembod); and case 4 used the DDPG algorithm proposed by Silver et al. (2014) (hereafter called DDPG). In case 4, we used a DDPG algorithm with the actor-critic method implemented by modifying a Python code of Morvanzhou (http://morvanzhou.github.io/tutorials/). The learning rates of the actor and critic were set to 0.001, while the $τ$ was set to 0.01. Although the muscle activation levels of 20 muscles were randomly determined within the range from 0.0 to 1.0, the same 10 state spaces as in cases 1 through 3 were used, but the DDPG algorithm did not include any AFMs.

In each simulation condition, average values of the reward function $r(s(t))$ and the difference $dθ$ between the current joint angle and the target joint angle for each joint motion were calculated by dividing the summation of each by the number of iterations. In addition, time-history curves of joint angles of elbow flexion-extension (ELV) and inversion-eversion (ELW), and shoulder internal-external rotation (SHU), flexion-extension (SHV), and inversion-eversion (SHW) and those of the activation levels of the flexors and extensors and adductors and abductors were generated. The four simulation conditions were compared to investigate the effectiveness of AFMs for efficient ACRL. Parameters of ACRL used in this study are listed on Table 3.
Figure 5:

Comparisons of average values of reward functions and angle differences of ELV among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) Reward functions. (b) Angle differences of ELV. ACRL embod, ACRL with an embodiment using equations 2.21 to 2.40; ACRL mlembod, ACRL with an embodiment using an equation 2.41; ACRL noembod, ACRL without any embodiments; DDPG, DDPG algorithm; ELV, elbow flexion-extension.

Figure 5:

Comparisons of average values of reward functions and angle differences of ELV among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) Reward functions. (b) Angle differences of ELV. ACRL embod, ACRL with an embodiment using equations 2.21 to 2.40; ACRL mlembod, ACRL with an embodiment using an equation 2.41; ACRL noembod, ACRL without any embodiments; DDPG, DDPG algorithm; ELV, elbow flexion-extension.

Table 3:
Parameters of ACRL Used in This Study.
SymbolEquationValueSymbolEquationValue
$τ$ 2.14 0.05 $κ$ 2.15 0.05
$σb1,3,5,7,9$ 2.10 26.5 $αV$ 2.16 0.3
$σb2,4,6,8,10$ 2.10 163.6 $αa$ 2.19 0.11
$c$ 2.11 0.01 $A$ 2.17 1.0
$σr$ 2.12 100.0 $B$ 2.17 −4.0
SymbolEquationValueSymbolEquationValue
$τ$ 2.14 0.05 $κ$ 2.15 0.05
$σb1,3,5,7,9$ 2.10 26.5 $αV$ 2.16 0.3
$σb2,4,6,8,10$ 2.10 163.6 $αa$ 2.19 0.11
$c$ 2.11 0.01 $A$ 2.17 1.0
$σr$ 2.12 100.0 $B$ 2.17 −4.0

## 3  Results

Figure 5 shows the comparisons of the reward functions and angle differences in ELV $dθELV$ from the 1st trial to the 10th trial between the four cases of ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. In all cases using ACRL, the values of TD gradually decreased and were approaching zero with the learning process, and the value functions gradually increased to approximately 6.0 at the 300th trial. In ACRL embod, the reward gradually increased to 8.6, and the value was retained until the 300th trial, while in ACRL mlembod, the reward gradually increased to 8.5 but decreased to 8.2 at the 300th trial. In ACRL noembod, the reward gradually increased to 8.7 but decreased to 8.5 at the 300th trial, while in DDPG, the reward gradually increased to 9.0 but decreased to 8.5 at the 300th trial.

Figure 6:

Comparisons of average values of angle differences of ELW, SHU, SHV, and SHW among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) Angle differences of ELW. (b) Angle differences of SHU. (c) Angle differences of SHV. (d) Angle differences of SHW. ELW, elbow inversion-eversion; SHU, shoulder internal-external rotation; SHV, shoulder flexion-extension; SHW, shoulder inversion-eversion.

Figure 6:

Comparisons of average values of angle differences of ELW, SHU, SHV, and SHW among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) Angle differences of ELW. (b) Angle differences of SHU. (c) Angle differences of SHV. (d) Angle differences of SHW. ELW, elbow inversion-eversion; SHU, shoulder internal-external rotation; SHV, shoulder flexion-extension; SHW, shoulder inversion-eversion.

Figure 5b shows that in ACRL embod, the angle differences of ELV were close to 0 degrees at the 300th trial, while the angle differences were about 9, 36, and 37 degrees at the 300th trial in ACRL mlembod, ACRL noembod, and DDPG, respectively. Figure 6 shows the comparisons of the angle differences of ELW, SHU, SHV, and SHW from the 1st trial to the 10th trial between the four cases. Figure 6a shows that in ACRL mlembod, the angle difference in ELW was $-$2 degrees at the 300th trial, while the angle differences became $-$20, $-$27, and $-$29 degrees in ACRL embod, ACRL noembod, and DDPG, respectively. Figure 6b shows that in ACRL embod, the angle difference of SHU became 12 degrees at the 300th trial, while the angle differences became 25, 30, and 30 degrees in ACRL mlembod, ACRL noembod, and DDPG, respectively. Figure 6c shows that the angle differences of SHV became $-$40, $-$19, $-$31, and $-$29 degrees in ACRL embod, ACRL mlembod, ACRL noembod, and DDPG, respectively. Figure 6d shows that the angle differences of SHW became $-$13, 20, 3, and 3 degrees in ACRL embod, ACRL mlembod, ACRL noembod, and DDPG, respectively.
Figure 7:

Comparisons of time histories of ELV and ELW angles among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) ELV angle. (b) ELW angle. ELV, elbow flexion-extension; ELW, elbow inversion-eversion.

Figure 7:

Comparisons of time histories of ELV and ELW angles among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) ELV angle. (b) ELW angle. ELV, elbow flexion-extension; ELW, elbow inversion-eversion.

Figure 7a shows the comparisons of time histories of the ELV angle at the 1st trial and 300th trial between the four cases. The vertical axis ranges of ELV in Figure 7a and ELW, SHU, SHV, and SHW in Figures 7b, 8, and 9 correspond to the angle rangles of ELV, ELW, SHU, SHV, and SHW defined in section 2.2, respectively. ACRL embod and ACRL mlembod held the joint angle of elbow flexion-extension on the target angle of $-$88 degrees at the 300th trial. ACRL noembod was close to the target angle at the 300th trial, but DDPG did not achieve it. Figure 7b shows the comparisons of time histories of the ELW angle at the 1st and 300th trials between the four cases. ACRL noembod held the joint angle of elbow inversion-eversion on the target angle of 54 degrees at the 1st trial, but it did not achieve the target angle at the 300th trial. ACRL embod and ACRL mlembod tended to achieve the target in the initial period from 0 to 0.2 s at the 300th trial. However, the other cases did not achieve the target angle. Figure 8a shows the comparisons of time histories of the SHU angle at the 1st and 300th trials between the four cases. ACRL mlembod held the joint angle of shoulder internal-external rotation on the target angle of $-$39 degrees at the 300th trial. The other cases did not achieve the target angle. Figure 8b shows the comparisons of time histories of the SHV angle at the 1st and 300th trials between the four cases. ACRL mlembod held the joint angle of shoulder flexion-extension on the target angle of $-$36 degrees at the 300th trial. The other cases did not achieve the target angle. Figure 9 shows the comparisons of time histories of the SHW angle at the 1st and 300th trials between the four cases. ACRL mlembod held the joint angle of shoulder inversion-eversion on the target angle of 36 degrees at the 300th trial. The other cases did not achieve the target angle.
Figure 8:

Comparisons of time histories of SHU and SHV angles among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) SHU angle. (b) SHV angle. SHU, shoulder internal-external rotation; SHV, shoulder flexion-extension.

Figure 8:

Comparisons of time histories of SHU and SHV angles among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) SHU angle. (b) SHV angle. SHU, shoulder internal-external rotation; SHV, shoulder flexion-extension.

Figure 9:

Comparisons of time histories of SHW angle among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) First trial. (b) Three-hundredth trial. SHW, shoulder inversion-eversion.

Figure 9:

Comparisons of time histories of SHW angle among the four models: ACRL embod, ACRL mlembod, ACRL noembod, and DDPG. (a) First trial. (b) Three-hundredth trial. SHW, shoulder inversion-eversion.

Figure 10 shows time-history curves of the muscle activation levels of flexors and extensors of the elbow joint and flexors and extensors and adductors and abductors of the shoulder joint at the 300th trial in ACRL embod and ACRL mlembod, which tended to realize posture stabilization at target angles. The biceps brachii long head, brachialis, and brachioradialis are the flexors of the elbow joint, while the triceps brachii long head is an extensor of the elbow joint. The deltoid anterior, coracobrachialis, and biceps brachii long head are flexors of the shoulder joint, while the deltoid posterior is an extensor of the shoulder joint. The teres major and subscapularis are adductors of the shoulder joint, while the deltoid middle and supraspinatus are abductors of the shoulder joint. Figure 7a shows that in ACRL embod, the initial angle of ELV was $-$20 degrees, and ACRL embod held the angle at $-$82 degrees, close to the target angle of $-$88 degrees, at the 300th trial. Activation levels of the brachialis and brachioradialis initially increased to flex the elbow joint to the target angle. However, because the elbow joint angle exceeded the target angle, the triceps brachii long head muscle increased to extend the elbow joint, and the flexors and extensors were then mutually antagonized to hold the posture (see Figure 10a). In ACRL mlembod, the initial angle of ELV was $-$61 degrees, and the angle then approached the target angle with some fluctuations around the target angle at the 300th trial. The flexors and extensors of the elbow joint were mutually antagonized to hold the posture (see Figure 10b). Figure 8a shows that in ACRL mlembod, the initial angle of SHU was $-$24 degrees, and the angle then tended to hold the posture with some fluctuation around the target angle at the 300th trial. Activation levels of adductors and abductors of the shoulder joint were mutually antagonized to hold the posture (see Figure 10b). In ACRL embod, the initial angle of SHU was $-$43 degrees, and the adductors of the shoulder joint were activated to achieve the target angle of $-$39 degrees; however, the shoulder joint angle had extensive internal rotations to 10 degrees, and the abductors could not return to the target angle at the 300th trial. Figure 8b shows that in ACRL mlembod, the initial angle of SHV was $-$64 degrees, and the angle then approached the target angle of $-$36 degrees with some fluctuations around the target angle at the 300th trial. Activation levels of flexors and extensors of the shoulder joint were mutually antagonized to hold the posture (see Figure 10b). In ACRL embod, the initial angle of SHV was 5 degrees, and the flexors of the shoulder joint were activated to achieve the target angle of $-$36 degrees; however, the shoulder joint angle had extensive flexion to $-$67 degrees, and the extensors could not return to the target angle at the 300th trial.
Figure 10:

Time histories of muscle activation levels of flexors and extensors of the elbow joint and flexors and extensors and adductors and abductors of the shoulder joint at the 300th trial in ACRL embod and ACRL mlembod. (a) ACRL embod. (b) ACRL mlembod. The biceps brachii long head, brachialis, and brachioradialis are flexors of the elbow joint, while the triceps brachii long head is an extensor of the elbow joint. The deltoid anterior, coracobrachialis, and bicepsbrachii long head are flexors of the shoulder joint, while the deltoid posterior is an extensor of the shoulder joint. The teres major and subscapularis are adductors of the shoulder joint, while the deltoid middle and supraspinatus are abductors of the shoulder joint. ACRL embod, ACRL with an embodiment using equations 2.21 to 2.40; ACRL mlembod, ACRL with an embodiment using equation 2.41.

Figure 10:

Time histories of muscle activation levels of flexors and extensors of the elbow joint and flexors and extensors and adductors and abductors of the shoulder joint at the 300th trial in ACRL embod and ACRL mlembod. (a) ACRL embod. (b) ACRL mlembod. The biceps brachii long head, brachialis, and brachioradialis are flexors of the elbow joint, while the triceps brachii long head is an extensor of the elbow joint. The deltoid anterior, coracobrachialis, and bicepsbrachii long head are flexors of the shoulder joint, while the deltoid posterior is an extensor of the shoulder joint. The teres major and subscapularis are adductors of the shoulder joint, while the deltoid middle and supraspinatus are abductors of the shoulder joint. ACRL embod, ACRL with an embodiment using equations 2.21 to 2.40; ACRL mlembod, ACRL with an embodiment using equation 2.41.

## 4  Discussion

In this study, we developed a novel muscle controller in which the ACRL method can produce the muscle activation level of each muscle in a musculoskeletal model of the right upper extremity of a human adult male and acquire better activation control policy for posture stabilization of the five joint motions of the human arm under gravity. Previous studies (Min et al., 2018; Kambara et al., 2004) have successfully obtained activation control policy for posture stabilization of the elbow joint or both the elbow and shoulder joints under gravity. The control policy obtained by the ACRL model of Min et al. (2018) demonstrated posture stabilization of the flexion-extension of the elbow joint even when a weight was loaded on the hand. However, the computational costs for learning were too high, and 700 trials were needed. The control policy obtained by the ACRL model of Kambara et al. (2004) also demonstrated posture stabilization of the flexion-extension motions of the elbow and shoulder joints; however, the computational costs for learning were too high: 22,500 trials were needed.

There are some limitations regarding the lower computational efficiency of the application of RL to real-world problems, including robot controls. That is, sufficient learning is not achieved with an insufficient number of trials, and the value of reward function does not increase. Thus, some countermeasures have been proposed by some researchers (Silver et al., 2014; Popov et al., 2018; Andrychowicz et al., 2017). As mentioned in section 1, their methodology to realize efficient learning is focused on the internal control system with RL, which corresponds to the brain. In this study, we focused on embodiment that can efficiently control walking or balancing in dynamic environments, as Hoffmann and Pfeifer (2012) suggested, and introduced two types of AFMs that control muscle tone for mutually antagonizing muscles such as flexors and extensors, adductors and abductors, and invertors and evertors into the output of the actor in the actor-critic method as information on the embodiment of a human being. The first AFM, which corresponds to ACRL embod, is described based on the differences in the five joint motions, while the second AFM, which corresponds to ACRL mlembod, is described based on the length rate of each muscle. We compared simulation results between the learning methods with AFMs and those without AFMs. We found that in ACRL embod, the reward gradually increased to 8.6, and the value was maintained until the 300th trial; furthermore, the posture of flexion-extension (ELV) of the elbow joint was stabilized at their corresponding target angles at the 300th trial. In ACRL mlembod, the reward gradually increased to 8.5 but decreased to 8.2 at the 300th trial. However, the postures of the five joint motions were almost stabilized at the corresponding target angles at the 300th trial, although the postures had some fluctuations around the target angles. In contrast, in the learning methods without any AFMs, ACRL noembod, and DDPG, the postures of the five joint motions were not stabilized at the target angles at the 300th trial. These simulation results suggest that the proposed method with AFMs realized posture stabilization at the predetermined target angles of the five joint motions of the human arm at a relatively earlier period of learning. These results suggest that the introduction of AFMs as embodiment of muscle tone can stabilize the posture of human musculoskeletal models and stabilize the joint motions of a humanoid robot, including muscular structure under gravity, with efficient learning costs.

The AFMs proposed in this study represent functions of the PPTn and MLR for mutually antagonizing muscles such as flexors and extensors, adductors and abductors, and invertors and evertors, and modulate the maximum values of the activation levels of the mutually antagonizing muscles, in which the activation levels are signals from the SNr, that is, an actor of ACRL. In our proposed models, we hypothesized that changes in each joint angle from each neutral angle, which was determined as space attitude in this study, can control the activation levels of each muscle. For example, in the case of elbow flexion-extension, the change in the flexed elbow joint angle from the neutral angle can increase the activation levels of extensors and decrease those of the flexors, whereas the change in the extended joint angle from the neutral angle can increase the activation levels of flexors and decrease those of extensors in posture stabilization; the same holds in the other joint motions.

We implemented the feature in two types of AFMs corresponding to ACRL embod and ACRL mlembod. In ACRL embod, the postures were stabilized for ELV; however, the postures were not stabilized for ELW, SHU, SHV, and SHW. The reason SHU and SHV were not stabilized on the target angles are probably because the musculoskeletal model did not include sufficient abductors and extensors of the shoulder joint to control SHU and SHV angles, respectively (see Figure 10a). According to the anatomical text (Neumann, 2010), there are the trapezius and serratus anterior other than the deltoid middle and supraspinatus as abductors of the shoulder joint, and there are the latissimus dorsi other than the deltoid posterior as extensors of the shoulder joint. In ACRL mlembod, the postures were almost stabilized for all five joint motions at the 300th trial. Nevertheless, the postures had some fluctuations around the target angles. One of the reasons is that the constant of sigmoid functions in the equation 2.41 was set to 500.0 for all muscles, which might not be sufficiently adjusted for individual muscles. In addition, the two AFMs do not include any terms of angular velocity differences or muscle length change velocity. The velocity terms have the potential to affect posture stabilization, particularly motion generation in dynamic environments. Furthermore, the upper extremity model used in this study did not include the acromio-clavicular joint (AC joint), sterno-clavicular joint (SC joint), or scapulo-thoracic joint (ST joint), which are necessary for predicting accurate shoulder motions; the absence of these three shoulder joints could affect the posture stabilization of SHU, SHV, and SHW (Neumann, 2010). Further studies are needed to investigate the effects of the constants in sigmoid functions of the two AFMs and the velocity terms on posture stabilization of the five joint motions and to investigate the effects of the AC, SC, and ST joints, as well as the trunk muscles of the trapezius, serratus anterior, and the latissimus dorsi on posture stabilization of SHU, SHV, and SHW.

In this study, ACRL and DDPG were used as learning methods. Neither ACRL noembod nor DDPG included any AFMs. The simulation results of DDPG were similar to those of ACRL noembod in terms of reward function and all angle differences. However, DDPG had better performance than ACRL noembod for the rate of increase in reward function and the rate of decrease in all angle differences except ELW (see Figures 5 and 6). In addition, the activation level of each muscle predicted in DDPG changed dramatically over the learning process compared with that in ACRL noembod. This suggests that DDPG has the potential for efficient learning of multiple action outputs as described in Silver et al. (2014). Further studies are needed to investigate whether DDPG with AFMs enhances the learning efficiency.

This study has some limitations pertaining to the AFMs and upper extremity model. First, although we hypothesized that both PPTn and MLR modulate the maximum values of activation levels of mutually antagonizing muscles depending on the joint angle differences or muscle length rate, we did not model how the maximum values of the activation levels can change depending on the joint angular velocity or muscle length change velocity due to a lack of experimental data on the relationship among joint angular velocities, muscle length velocity change, and muscle activation levels. Second, the upper extremity model used in this study does not include AC, SC, and ST joints, as well as the trunk muscles, which are necessary for accurate shoulder motions.

## 5  Conclusion

In this study, we aimed to reduce computational costs due to repeated calculations of learning for posture stabilization of the human arm and to enhance learning efficiency. Thus, we developed a novel motor learning system that could simulate the five joint motions, namely, elbow flexion-extension and inversion-eversion, shoulder internal-external rotation, flexion-extension, and inversion-eversion of the human right arm under gravity using the ACRL algorithm and a musculoskeletal model of the right upper extremity of a human adult male. We also introduced AFMs of the mutually antagonizing muscles as the embodiment of muscle tone to the motor learning system, considering the functions of the PPTn and MLR based on the neurophysiology of the brain stem, and investigated whether the AFMs influenced the learning efficiency in posture stabilization of the five joint motions. We found that ACRL with AFMs enabled the human arm model to stabilize the postures of the five joints, even in the relatively early learning period. This indicates that the AFMs could generate significant improvements in learning efficiency. Further studies are needed to investigate how the introduction of the AC, SC, and ST joints, as well as the trunk muscles in the human upper extremity model and the velocity terms in the AFMs or DDPG with the AFMs, could affect posture stabilization of the human arm and enhance the learning efficiency.

## Acknowledgments

We thank Kyuengbo Min of Tokyo Metropolitan Institute of Medical Science for his efforts in the initial development of the methodology on reinforcement learning with the actor-critic model for posture stabilization when he joined Toyota Central R&D Labs. We thank Takahiro Yamaguchi of Toyota Central R&D Labs. for his contribution in the development of the skeletal part of the human model using Matlab. We also thank Editage (www.editage.com) for English-language editing.

## References

Amis
,
A.
,
Dowson
,
D.
, &
Wright
,
V.
(
1979
).
Muscle strengths and musculoskeletal geometry of the upper limb
.
Eng. Med.
,
8
(
1
),
4
48
.
An
,
K.
,
Hui
,
F.
,
Morrey
,
B.
,
Linscheid
,
R.
, &
Chao
,
E.
(
1981
).
Muscles across the elbow joint: A biomechanical analysis
.
J. Biomech.
,
14
(
10
),
659
669
.
Andrychowicz
,
M.
,
Wolski
,
F.
,
Ray
,
A.
,
Schneider
,
J.
,
Fong
,
R.
,
Welinder
,
P.
, …
Zaremba
,
W.
(
2017
). Hindsight experience replay. In
Proceedings of the 31st Conference on Neural Information Processing Systems
(pp.
5048
5058
).
Red Hook, NY
:
Curran
.
Barto, A. G. (
1995
). Adaptive critics and the basal ganglia. In
J. C.
Houk
,
J.
Davis
, &
D.
Beiser
(Eds.),
Models of information processing in the basal ganglia
(pp.
215
232
).
Cambridge, MA
:
MIT Press
.
Doya
,
K.
(
2000a
).
Reinforcement learning in continuous time and space
.
Neural Computation
,
12
(
1
),
219
245
.
Doya
,
K.
(
2000b
).
Complementary roles of basal ganglia and cerebellum in learning and motor control
.
Current Opinion in Neurobiology
,
10
(
6
),
732
739
.
Gans
,
C.
(
1982
).
Fiber architecture and muscle function
.
Exercise Sports Sciences Reviews
,
10
(
1
),
160
207
.
Hoffmann
,
M.
, &
Pfeifer
,
R.
(
2012
).
The implications of embodiment for behavior and cognition: Animal and robotic case studies
. arXiv:1202.0440.
Iwamoto
,
M.
,
Nakahira
,
Y.
,
Kimpara
,
H.
,
Sugiyama
,
T.
, &
Min
,
K.
(
2012
).
Development of a human body finite element model with multiple muscles and their controller for estimating occupant motions and impact responses in frontal crash situations
.
Stapp Car Crash Journal
,
56
,
231
268
.
Kambara
,
H.
,
Kim
,
K.
,
Shin
,
D.
,
Sato
,
M.
, &
Koike
,
Y.
(
2004
). Learning arm's posture control using reinforcement learning and feedback-error-learning.
In Proceedings of the 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society
(pp.
486
489
).
Piscataway, NJ
:
IEEE
.
Kambara
,
H.
,
Kim
,
K.
,
Shin
,
D.
,
Sato
,
M.
, &
Koike
,
Y.
(
2009
).
Learning and generation of goal-directed arm reaching from scratch
.
Neural Networks
,
22
,
348
361
.
Kato
,
D.
,
Nakahira
,
Y.
,
Atsumi
,
N.
, &
Iwamoto
,
M.
(
2018
). Development of human-body model THUMS Version 6 containing muscle controllers and application to injury analysis in frontal collision after brake deceleration. In
Proceedings of the 2018 International IRCOBI Conference on the Biomechanics of Injury
. IRCOBI Council.
Kuechle
,
D.
,
Newman
,
S.
,
Itoi
,
E.
,
Morrey
,
B.
, &
An
,
K.
(
1997
).
Shoulder muscle moment arms during horizontal flexion and elevation
.
J. Shoulder and Elbow Surgery
,
6
(
5
),
429
439
.
Kuechle
,
D.
,
Newman
,
S.
,
Itoi
,
E.
,
Morrey
,
B.
, &
An
,
K.
(
2000
).
The relevance of the moment arm of shoulder muscles with respect to axial rotation of the glenohumeral joint in four positions
.
Clin. Biomech.
,
15
(
5
),
322
329
.
Min
,
K.
,
Iwamoto
,
M.
,
Kakei
,
S.
, &
Kimpara
,
H.
(
2018
).
Muscle synergy-driven robust motion control
.
Neural Computation
,
30
,
1104
1131
.
Morimoto
,
J.
, &
Doya
,
K.
(
2005
).
Robust reinforcement learning
.
Neural Computation
,
17
,
335
359
.
Murray
,
W.
,
Buchanan
,
T.
, &
Delp
,
S.
(
2000
).
The isometric functional capacity of muscles that cross the elbow
.
Journal of Biomechanics
,
33
(
8
),
943
952
.
Murray
,
W.
,
Buchanan
,
T.
, &
Delp
,
S.
(
2002
).
Scaling of peak moment arms of elbow muscles with upper extremity bone dimensions
.
Journal of Biomechanics
,
35
(
1
),
19
26
.
Murray
,
W.
,
Delp
,
S.
, &
Buchanan
,
T.
(
1995
).
Variation of muscle moment arms with elbow and forearm position
.
Journal of Biomechanics
,
28
(
5
),
513
525
.
Neumann
,
D.
(
2010
).
Kinesiology of the musculoskeletal system: Foundations for rehabilitation
(2nd ed.).
Amsterdam
:
Elsevier
.
Nikravesh
,
P.
(
1988
).
Computer-aided analysis of mechanical systems
.
:
Prentice Hall
.
Pfeifer
,
R.
,
Lungarella
,
M.
, &
Iida
,
F.
(
2007
).
Self-organization, embodiment, and biologically inspired robotics
.
Science
,
318
(
5853
),
1088
1093
.
Popov
,
I.
,
Heess
,
N.
,
Lillicrap
,
T.
,
Hafner
,
R.
,
Barth-Maron
,
G.
,
Vecerik
,
M.
, …
Reidmiller
,
M.
(
2018
). Data-efficient deep reinforcement learning for dexterous manipulation. In
Proceedings of the 31st Conference on Neural Information Processing Systems
.
Red Hook, NY
:
Curran
.
Rooij
,
L.
(
2011
). Effect of various pre-crash braking strategies on simulated human kinematic response with varying levels of driver attention. In
Proceedings of the Conference on the Enhanced Safety of Vehicles
.
Washington, DC
:
National Highway Safety Administration, U. S. Department of Transportation
.
Shibata
,
K.
, &
Ito
,
K.
(
1999
). Gauss-sigmoid neural network. In
Proceedings of the International Joint Conference on Neural Networks.
Piscataway, NJ
:
IEEE
.
Silver
,
D.
,
Lever
,
G.
,
Heess
,
N.
,
Degris
,
T.
,
Wierstra
,
D.
, &
Riedmiller
,
M.
(
2014
). Deterministic policy gradient algorithms. In
Proceedings of the 31th International Conference on Machine Learning
.
Takakusaki
,
K.
(
2017
).
Functional neuroanatomy for posture and gait control
.
J. Mov. Disord.
,
10
(
1
),
1
17
.
Tengwall
,
R.
,
Jackson
,
J.
,
Kimura
,
T.
,
Komenda
,
S.
,
,
M.
, &
Preuschoft
,
H.
(
1982
).
.
Current Anthropology
,
23
(
6
),
657
666
.
Thelen
,
D.
(
2003
).
Adjustment of muscle mechanics model parameters to simulate dynamic contractions in older adults
.
Journal of Biomechanical Engineering
,
125
,
70
77
.
Thelen
,
D.
,
Anderson
,
F.
, &
Delp
,
S.
(
2003
).
Generating dynamic simulations of movement using computed muscle control
.
Journal of Biomechanics
,
36
(
3
),
321
328
.
Todorov
,
E.
,
Erez
,
T.
, &
Tassa
,
Y.
(
2012
). Mujoco: A physics engine for model-based control. In
Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems
(pp.
5026
5033
).
Piscataway, NJ
:
IEEE
.
Winters
,
J.
(
1990
). Hill-based muscle models: A system engineering perspective. In
J.
Winters
&
S.
Woo
(Eds.),
Multiple muscle systems: Biomechanics and movement organization
(1:726–749).
New York
:
Springer
.
Yang
,
J.
,
Lee
,
J.
,
Lee
,
B.
,
Kim
,
S.
,
Shin
,
D.
,
Lee
,
Y.
, …
Choi
,
S.
(
2014
).
The effects of elbow joint angle changes on elbow flexor and extensor muscle strength and activation
.
J. Phys. Ther. Sci.
,
26
,
1079
1082
.
Zajac
,
F.
(
1989
).
Muscle and tendon: Properties, models, scaling, and application to biomechanics and motor control
.
Critical Review in Biomedical Engineering
,
17
(
4
),
359
411
.