## Abstract

We propose a neural network model for reinforcement learning to control a robotic manipulator with unknown parameters and dead zones. The model is composed of three networks. The state of the robotic manipulator is predicted by the state network of the model, the action policy is learned by the action network, and the performance index of the action policy is estimated by a critic network. The three networks work together to optimize the performance index based on the reinforcement learning control scheme. The convergence of the learning methods is analyzed. Application of the proposed model on a simulated two-link robotic manipulator demonstrates the effectiveness and the stability of the model.

## 1 Introduction

In recent years, with the increasing demand of automatic manipulation and manufacturing in industry, there has been a substantial growth of the deployment of robotic manipulators in engineering applications, such as nuclear power plants (Kim, Seo, Lee, Choi, & Moon, 2015), rehabilitation (Ka, Ding, & Cooper, 2016), haptics and virtu of reality (Zarenia & Sepehri, 2015), cognition (Azeem, Iqbal, Toivanen, & Samad, 2012; Naveed, Iqbal, & Ur Rehman, 2012), motion assistance (Naitchabane, Hoppenot, & Colle, 2016), target detection (Iqbal, Pasha, Baizid, Khan, & Iqbal, 2013; Iqbal, Pasha, Nabi, Khan, & Tqbal, 2013), and other areas (Iqbal, Rehmansaad, Malik, & Mahmood Tahir, 2014; Baizid, Chellali, Yousnadj, Iqbal, & Bentaleb, 2010). To function as an intelligent system, it is unavoidable for a robotic manipulator to deal with the problems of unknown parameters and nondifferentiable nonlinearity—the dead zone (Tao & Kokotovic, 1994). It has been pointed out that the issues of unknown parameters and the dead zone may lead to large steady-state error, poor transient response, large overshoot, and other undesirable performance (Wang, Yan, Cai, & Zhang, 2006). Wang (2010) developed a recursive algorithm to estimate the unknown manipulator parameters and the payload parameters. Neural network models are also proposed by robotic researchers as intelligent control algorithms (Min, Lu, Xu, Duan, & Chen, 2017; Wang, Sun, Pan, & Yu, 2017) or for spatial perception (Leitner, Harding, Frank, Förster, & Schmidhuber, 2013). Compared with other intelligent systems, there are nonlinearities, such as dead zones, in robotic manipulators, and it is difficult to analyze those nonlinearities and build their exact mathematical models (Gao & Selmic, 2006). Neural networks have been widely used in control design for robotic manipulators in recent years (He, Dong, & Sun, 2015) due to their ability to model the dynamics and the nonlinearities of a robotic manipulator. Chen, Wen, Liu, and Wang (2014) approximated the uncertain nonlinear dynamics of a multiagent time delay system by radial basis function neural networks. Kim and Lewis (2000) used neural network models to build controllers for a robotic manipulator in the presence of modeling uncertainties and frictional forces. Liu, Kadirkamanathan, and Billings (1999) estimated the unknown nonlinearity of a dynamical system by a novel neural network architecture, referred to as a variable neural network. In Huang and Tan (2012) and Seidl, Lam, Putnam, and Lorenz (1993), nonlinear models of friction and backlash hysteresis were approximated by neural networks.

Except neural network methods, the theory of adaptive control has drawn much attention in developing control algorithms for robotic manipulators. An adaptive control method was integrated with a sliding mode technique and a neural network to identify unknown parameters (Sun, Pei, Pan, Zhou, & Zhang, 2011; Purwar, Kar, & Jha, 2005). Schindele and Aschemann (2009) considered model-based compensation of nonlinearities for robotic manipulators. The adaptive force/motion tracking control was investigated by Li, Ge, Adams, and Wijesoma (2008) for nonholonomic mobile manipulators with unknown parameters and disturbances under uncertain holonomic constraints. Adaptive optimal control without weight transport was proposed by Chinta and Tweed (2012). A series of algorithms were designed in Tong, Wang, Li, and Chen (2013), Chen, Jiao, Li, and Li (2010), Li, Cao and Ding (2011), Chen and Jiao (2010), Liu, Tong, Wang, Li, and Chen (2011), Tang, Liu, and Tong (2014) for nonlinear systems with uncertain functions and a dead zone, based on intelligent control methods.

Furthermore, it has become popular to combine adaptive control theory with neural networks for better controller design. Sanner and Slotine (1995) extended the stable adaptive tracking controller designs, employing neural networks, to classes of multivariable mechanical systems, that is, robot manipulators. Liu, Chen, Zhang, and Chen (2015) proposed an adaptive neural network controller for dual-arm coordination of a humanoid robot. Chen, Liu, and Wen (2014) proposed adaptive fuzzy neural networks to estimate a nonlinear stochastic system with unknown functions. An adaptive controller, based on a higher-order neural network was employed for a class of uncertain multiple-input-multiple-output (MIMO) nonlinear systems in Liu, Chen, Wen, and Tong (2011) and Liu, Tang, Tong, Chen, and Li (2015). An adaptive neural network control method was proposed by Liu, Tang, Tong, and Chen (2015) for MIMO nonlinear systems in strict feedback form. Rossomando, Soria, and Carelli (2013) proposed a neural adaptive compensator and a trajectory tracking controller for a unicycle-like mobile robot.

Reinforcement learning (RL) is another approach to tackle these control issues. With the development of representation learning techniques, many control methods based on RL have solved large-scale problems (Schmidhuber, 2015). An actor agent, applying an action or a control policy to the actuator (the environment), and a critic component (the agent), assessing the cost-to-go at current state with current control policy, are contained in a typical RL actor-critic structure (Yang & Jagannathan, 2012). The RL output signal is updated via the value of the critic component. It is the ultimate goal that the cost-to-go is converged to its global optimum. A variety of intelligent control algorithms have been developed based on RL to solve control problems (Gosavi, 2014; Sharma & Gopal, 2010; Runa & Sharma, 2015; Lin, 2009). Compared with conventional algorithms, the advantage of RL is that it is not necessary to obtain a perfect dynamic model of the system being controlled. A critic-actor-based–RL neural network control method was proposed to achieve adaptive control during poststroke fine hand motion rehabilitation training by Huang, Naghdy, Du, and Naghdy (2015). Gu, Holly, Lillicrap, and Levine (2016) discussed a deep neural network that learned to approximate the Q-function based on off-policy training to solve complex 3D manipulation tasks. Yahya, Kalakrishnan, Chebotar, and Levine (2016) explored distributed and asynchronous policy learning as a means to achieve generalization and improve training performance on challenging, real-world manipulator tasks. RL coupled with adaptive dynamic programming (ADPRL) has also become popular and can be classified into (Prokhorov & Wunsch, 1997; Wang, Zhang, & Liu, 2009; Si, Barto, Powell, & Wunsch, 2004) heuristic dynamic programming (HDP), action-dependent HDP (ADHDP), dual heuristic dynamic programming (DHP), action-dependent DHP (ADDHP), globalized DHP (GDHP), and ADGDHP (action-dependent GDHP).

In this letter, a neural network model for RL is presented to control a rigid robotic manipulator with unknown parameters and a dead zone. The model has three neural networks: a state network, action network, and critic network. The state network is used to estimate the state information of the robotic manipulator, the critic network is proposed to evaluate the performance of the network, and the action network learns a policy to optimize the performance index. The reliability of this network controller is proven by simulations of a two-link robotic manipulator. It should be noted that the proposed model is applicable when the dynamics and the parameters of the robot manipulator are not known but the output information of the robot manipulator can be measured by sensors. Thus, this letter provides an adaptive control policy for a complex dynamical system, without a priori knowledge of the dynamics and the parameters of the system, even when there are nonlinearities in the system.

The letter is organized as follows. Section 2 introduces the nonlinear dynamic model of a robotic manipulator. In section 3, we describe the network model and the learning rules in detail. We apply the network model to control a simulated two-link robotic manipulator in section 4, and the simulation demonstrates the effectiveness of the model. The letter concludes in section 5.

## 2 Problem Formulation

## 3 Control Design

### 3.1 Introduction to Reinforcement Learning

RL is a learning para-digm for sequential decision making, solving the delayed credit assignment problem in a wide range of fields in science and engineering (Sutton & Barto, 2017). The idea of RL comes from the phenomenon that humans and other animals learn from experience via reward and punishment for survival and growth (Si, Barto, Powell, & Wunsch, 2012; Werbos, 2009; Lewis & Liu, 2013). RL agents interact with the environment and learn by trial and error. At each time , a state is received by the agent, and an action is selected from the action space , following a policy , which represents the agent's behavior. Meanwhile, the environment transits to the next state according to the state transition probability , and a reward based on the reward function is released to the agent (Szepesvari, 2010; Liu, Wei, Wang, Yang, & Li, 2013). In an iterative problem, this process always keeps going until the agent reaches the terminal state and then restarts. The agent aims to find the actions with maximal return from each state measured, for example, by the discounted cumulative reward , with the discount factor .

The state value function is the expected return when starting in and following policy . The optimal state value function is the maximum expected return starting in state . The value of a state-action pair is measured by the state-action value function , defined as the expected return when starting from a station-action pair following a policy. An optimal state-action value function is the maximal state-action value function achievable by any policy for state and action . When the maximum value is sought, the optimal actions are found correspondingly.

Although RL methods can be applied to solve some manipulation tasks for dynamic robotic systems, they suffer from poor scalability in high-dimensional nonlinear systems (Levine & Koltun, 2013). Neural networks offer the ability of universal function approximation in high-dimensional space. Therefore, in this letter, we use neural networks to represent the states of the system, approximate the state-action value function, and learn the policy for optimal control.

### 3.2 Neural Network Model for Reinforcement Learning

To control a robotic manipulator, we introduce a neural network model (see Figure 1), which learns a control policy by RL. The structure of the network model is similar to the actor-critic algorithm and contains three subnetworks: the state network, the critic network, and the action network. The state network predicts the state of the system based on the previous state and the action, the critic network provides an internal estimation of the performance of the network, and the action network searches for a policy to minimize the estimated performance index. Each subnetwork is implemented by a three-layer perceptron network with the same activation function .

#### 3.2.1 State Network

#### 3.2.2 Critic Network

#### 3.2.3 Action Network

#### 3.2.4 Weights Convergence Analysis

It is important that the three subnetworks work together and the whole network is kept stable. According to Lyapunov stability theory, we define a Lyapunov function of the model and show that the function is always decreasing. Thus, the learning algorithm proposed in this letter is convergent. The details of the proof are shown in the appendix.

## 4 Simulation

According to the engineering needs, the physical parameters of the two-link robotic manipulator and the corresponding values used in the simulation are listed in Table 1.

Symbol . | Parameter . | Value . |
---|---|---|

Length of the first link | 0.2 m | |

Length of the second link | 0.2 m | |

Gravity acceleration | 9.8 m/s | |

Mass of the first link | 2 kg | |

Mass of the second link | 1 kg | |

Maximum angular speed of the first link | 2 rad/sec | |

Maximum angular speed of the second link | 2 rad/sec | |

Maximum torque of the first link | 0.2 Nm | |

Maximum torque of the second link | 0.2 Nm |

Symbol . | Parameter . | Value . |
---|---|---|

Length of the first link | 0.2 m | |

Length of the second link | 0.2 m | |

Gravity acceleration | 9.8 m/s | |

Mass of the first link | 2 kg | |

Mass of the second link | 1 kg | |

Maximum angular speed of the first link | 2 rad/sec | |

Maximum angular speed of the second link | 2 rad/sec | |

Maximum torque of the first link | 0.2 Nm | |

Maximum torque of the second link | 0.2 Nm |

Through numerical calculation, the existence of can be confirmed for , a result guaranteed by the property shown in section 2. Therefore, the algorithm proposed in this letter based on the state equation of robotic manipulators is valid. The complete process state, given by , . , is the command of the input to the manipulator.

In the simulation, the dynamics and the parameters of the two-link robotic manipulator are not known to the network model. The dead zone (the threshold value is chosen as 1 s) is added in the two-link manipulator to test the control performance of the proposed network model. The network treats the manipulator as a black box and searches for a policy to optimize the performance index by RL.

On the basis of the discussion above and the feature of this two-link robotic manipulator, the parameters for the network model are given in Table 2. The initial weights for these three networks are set randomly within . We trained the network for 500 trials, and each trial lasted 100 seconds. In order to show the generalization ability of the proposed network model, during the training process, the initial positions and the final positions are set randomly for this two-link manipulator. During the full simulation, the angular position, and its speed and acceleration are kept within certain bounds to match the physical limits of the manipulator. During the learning trials, the loss functions () are reduced quickly and are already stabilized after about 10 seconds (see Figure 3). This demonstrates that the learning methods converge.

Symbol . | Parameter . | Value . |
---|---|---|

Discount factor | 0.85 | |

, , | Learning rate | 0.1 |

Coefficient matrix in the performance index | 4 4 identity matrix | |

Coefficient matrix in the performance index | 2 2 identity matrix | |

Number of input units in the state network | 4 | |

Number of hidden units in the state network | 8 | |

Number of output units in the critic network | 4 | |

Number of input units in the critic network | 6 | |

Number of hidden units in the critic network | 10 | |

Number of output units in the action network | 1 | |

Number of input units in the action network | 4 | |

Number of hidden units in the action network | 10 | |

Number of output units in the action network | 2 |

Symbol . | Parameter . | Value . |
---|---|---|

Discount factor | 0.85 | |

, , | Learning rate | 0.1 |

Coefficient matrix in the performance index | 4 4 identity matrix | |

Coefficient matrix in the performance index | 2 2 identity matrix | |

Number of input units in the state network | 4 | |

Number of hidden units in the state network | 8 | |

Number of output units in the critic network | 4 | |

Number of input units in the critic network | 6 | |

Number of hidden units in the critic network | 10 | |

Number of output units in the action network | 1 | |

Number of input units in the action network | 4 | |

Number of hidden units in the action network | 10 | |

Number of output units in the action network | 2 |

After training, the learned network model is used to control the robotic manipulator. The initial values of and are set as 0 for the test experiment, and the final values of and are set as and , respectively. Within 20 seconds, the angular positions of the robotic manipulator are steered to the desired final positions (see Figures 4a and 4b). The trajectories of the angles of the manipulator are quite smooth. The angular speeds of the links are increased first and then decreased to zero (see Figures 4c and 4d). The control signals produced by the network reach the maximal admissible torque of the manipulator in the beginning (see Figures 4e and 4f and Table 1). This allows the network to quickly steer the manipulator toward the desired positions. At almost halfway to the desired positions, the control signals change the sign to reduce the speeds of the links. When the manipulator is close to the desired positions, the control signals become very small, stabilizing the manipulator in the end. This kind of control strategy learned by the network shows that the proposed network model is able to solve the control task without a priori knowledge of the robotic manipulator.

## 5 Conclusion

The problem of controlling a class of robotic manipulator systems with unknown parameters and a dead zone is discussed in this letter. A neural network for reinforcement learning is presented to search for the optimal control policy. The network model comprises three distinctive subnetworks: the state network, used to predict the state of the robotic manipulator; the performance index, approximated by the critic network; and the action network, used to learn the optimal control policy by optimizing the performance index. Due to its learning capability, the network model is able to cope with the unknown dynamics and unknown physical parameters of the robotic manipulator. The convergence of the learning rules of the proposed model is proved based on the Lyapunov stability theory. The effectiveness of the approach is illustrated by simulation results of a two-link manipulator. The application of the proposed network model to control a robotic manipulator with more than two links will be investigated in the future.

## Appendix: Proof of Weights Convergence

It remains to determine , , , and .

Therefore, is an upper bound of . The conditions for to be negative are , , , , , , .

Under these conditions, is always negative. According to the standard Lyapunov extension theorem (Khalil, 2002), the auxiliary error and the error in the weight updating are uniformly ultimately bounded, and the weights used in this letter are bounded, correspondingly. As a consequence, the algorithm proposed in this letter is convergent.

## Acknowledgments

This work is supported by the National Key Research and Development Program of China (2016YFC0801808). We thank our reviewers for suggestions in reporting this work.