## Abstract

One key idea behind morphological computation is that many difficulties of a control problem can be absorbed by the morphology of a robot. The performance of the controlled system naturally depends on the control architecture and on the morphology of the robot. Because of this strong coupling, most of the impressive applications in morphological computation typically apply minimalistic control architectures. Ideally, adapting the morphology of the plant and optimizing the control law interact so that finally, optimal physical properties of the system and optimal control laws emerge. As a first step toward this vision, we apply optimal control methods for investigating the power of morphological computation. We use a probabilistic optimal control method to acquire control laws, given the current morphology. We show that by changing the morphology of our robot, control problems can be simplified, resulting in optimal controllers with reduced complexity and higher performance. This concept is evaluated on a compliant four-link model of a humanoid robot, which has to keep balance in the presence of external pushes.

## 1 Introduction

The control of compliant robots is inherently difficult, due to their nonlinear stochastic dynamics. Morphological computation poses the vision that parts of the complexity of a control or learning task can be outsourced to the morphology of the robot [11, 12]. This has been illustrated in many impressive biologically inspired robotic applications. For example, Tedrake et al. [14] showed that the task complexity of learning to walk is drastically reduced by exploiting the dynamics of a passive biped walker. Iida and Pfeifer [8] demonstrated that the stability of a simple four-legged robot exhibits surprisingly robust behavior using a mixture of active and passive joints. For swimming [20] or flying [19] robots, investigations of the morphologies yield a rich behavioral diversity using only a single actuator exploiting the dynamics of the environment.

These approaches typically used minimalistic control architectures, like open loop actuation with sinusoidal drives [8, 20, 19] or a simple linear combination of state-dependent features [14]. The morphology is assumed to absorb much of the computation needed for controlling the robot, and therefore less emphasis is placed on the controller. More sophisticated control architectures were omitted, as these are typically more difficult to fine-tune for a new morphology. Thus, finding a good control law constitutes an individual optimization problem for each morphology, which renders the joint optimization of the morphology and the controller challenging. However, an autonomous system has to encompass both aspects—the morphology and the control architecture—for efficient learning of complex motor skills.

One remarkable exception to the typically used minimalistic control architectures is the theoretical study of the morphological contribution for compliant bodies in Hauser et al. [5]. Those authors have demonstrated in simulations for simple but nonlinear generic models of physical bodies, based on mass-spring systems, that by outsourcing the computation to the physical body, the difficult problem of learning to control a complex body could be reduced to a simple and perspicuous linear learning task. Thus, it cannot get stuck in local minima of an error function. However, learning the optimal morphology and the optimal control law remains an open problem for many motor control tasks using real robots.

As a first step toward this vision, we propose to directly use optimal control methods for eliminating the dependence between the morphology of the robot and the control architecture. Thus, for evaluating the computational power of the morphology, we will always use the optimal control law connected to this morphology. This allows for a fair comparison of different morphologies and simplifies the search for an optimal morphology.

For determining our optimal control law for a given morphology, we can use one of the many tools provided by stochastic optimal control (SOC) methods, such as those of Todorov and Li [15], Kappen, [6], and Toussaint [16]. These methods have been shown to be powerful approaches for movement planning in high-dimensional robotic systems. We will use approximate inference control (AICO) [16], as it is a state-of-the-art planning method for stochastic optimal control tasks. AICO is based on probabilistic inference for motor control. The beauty of this approach is that there is no distinction between sensor and motor, perception and action. We can include a multitude of variables, some of which might represent some features of the state, some of which might represent goals, constraints, or motivations in the future, and some of which might represent future actions or motor signals.

Like all other stochastic optimal control methods, AICO minimizes a cost function, which is in our case given by the quadratic distance to a target state and the energy used in the movement. In order to apply the AICO method to torque-constraint dynamical models, we will briefly explain in Section 2 how to extend the algorithm to systems with control limits.

Quantifying how much computation is done by the morphology of the robot and how much is done by the controller is often a difficult problem. Despite the advances in theoretical models of morphological computation [5] where the computational power of the plant can be assessed in a principled manner, this remains an open problem for many motor control tasks using real robots. Yet, by the use of optimal control laws, the computation done by the morphology can be quantified by investigating the complexity of the controller. Thus, if the controller needs to do a lot of computation, less computation is provided by the morphology in order to fulfill a task.

AICO also provides us with a time-varying linear feedback controller as a policy to generate movement trajectories. We will use the variance of the control gains as a complexity measure of the controller. If the control gains are almost constant in time, the control law is close to linear and does not perform much computation. However, if the control gains are highly varying, the controller needs to do a lot of computation, and therefore, less computation is provided by the morphology.

As different complexity measures are possible, we additionally use the final costs and the total jerk of a movement trajectory for comparison. We illustrate the power of morphological computation combined with optimal control on a dynamical model of a humanoid robot (70 kg, 2 m). The robot is modeled by a four-link pendulum, which has to keep balance in the presence of external pushes. We will show in Section 3 that on changing the morphology of a robot (e.g., the joint friction, the link lengths, or the spring constants), the resulting optimal controllers have reduced complexity. As we will demonstrate, there are optimal values for the physical properties for a given control task, which can only be found by the use of optimal control laws. With naive control architectures, different, suboptimal morphologies would be chosen.

### 1.1 Related Work

We propose to use optimal control methods to eliminate the dependence between the control architecture and the morphology of the robot. These SOC methods, such as those of Todorov and Li [15], Kappen [6], and Toussaint [16], have been shown to be powerful methods for controlling high-dimensional robotic systems. For example, the incremental linear quadratic Gaussian (iLQG) [15] algorithm is one of the most commonly used SOC methods. It uses Taylor expansions of the system dynamics and cost function to convert the nonlinear control problem in a linear dynamics, quadratic costs, and Gaussian noise (LQG) system. The algorithm is iterative—the Taylor expansions are recalculated at the newly estimated optimal trajectory for the LQG system.

An SOC problem can be reformulated as an inference problem in a graphical model [7, 16], which has the nice property that there is no distinction between sensor and motor, perception and action. The unknown quantities (i.e., the states and actions) can be efficiently inferred by using, for example, the AICO algorithm [16]. The graphical model is given by a simple dynamic Bayesian network with states **x**_{t}, actions **u**_{t}, and task variables **z**_{t} (representing the costs) as nodes; see Figure 1. In this graphical model, observations are denoted by shaded nodes, in contrast to the unknown random variables, which are indicated by circles. If beliefs in the graphical model are approximated as Gaussian, the resulting algorithm is very similar to iLQG. Gaussian message passing iteratively reapproximates local costs and transitions as LQG around the current mode of the belief within a time slice. One contrast with iLQG is that AICO uses forward messages instead of a forward rollout to determine the point of a local LQG approximation. Thus, AICO allows us to iterate belief reapproximation within a time slice until convergence. This may lead to faster overall convergence [16].

The dynamic Bayesian network shown in Figure 1 is fully specified by the conditional distributions encoded by the cost function and by the state transition model. As with most SOC methods [6, 15, 16], we assume that the cost function and the state transition model are known. However, for model learning, many types of function approximators can be applied [9, 10, 18]. For learning the cost function, inverse reinforcement learning methods such as those of Abbeel and Ng [1] and Boularias et al. [3] could be used. Since we demonstrate in this article how morphological computation and optimal control can benefit from each other in general, an extension to the more complex learning problems remains for future research.

The original formulation of the AICO method [16] does not consider torque limits, which are important for many robotic experiments as well for the dynamic balancing experiments we consider in this article. Therefore we extended the algorithm, as is discussed in the next section.

## 2 Probabilistic Inference for Motor Planning

We use the probabilistic planning method AICO [16] as the optimal control method. Most applications of AICO are in the kinematic planning domain. Here, we want to apply AICO to fully dynamic, torque-controlled robot simulations. Therefore, we had to extend the AICO framework with control or torque limits, as is explained in the next sections.

### 2.1 Approximate Inference Control

**x**

_{t}denote the state, and

**u**

_{t}the control vector at time step

*t*. A trajectory τ is defined as a sequence of state-control pairs, τ = 〈

**x**

_{1:T},

**u**

_{1:T−1}〉, where

*T*is the length of the trajectory. Each trajectory has associated costswhere

*c*

_{t}(

**x**

_{t},

**u**

_{t}) represents the cost function for a single time step, which is in our case given by the quadratic distance to a target state and the energy used for the movement. Only the cost function

*c*

_{1:T}(·) and the initial state

**x**

_{1}are known to the optimal control algorithm. The unknown trajectory τ is the result of the inference process.

*L*(τ) into an inference problem, for each time step, an individual binary random variable

*z*

_{t}is introduced. This random variable indicates a reward event. Its probability is given byAICO now assumes that a reward event

*z*

_{t}= 1 is observed at every time step; see Figure 1. Given that evidence, AICO calculates the posterior distribution

*P*(

**x**

_{1:T},

**u**

_{1:T}|

*z*

_{1:T}= 1) over trajectories.

We will use the simplest version of AICO, where an extended Kalman smoothing approach is used to estimate the posterior *P*(**x**_{1:T}, **u**_{1:T}|*z*_{1:T} = 1). The extended Kalman smoothing approach uses Taylor expansions to linearize the system and subsequently uses Gaussian messages for belief propagation in a graphical model. Gaussian message passing iteratively reapproximates local costs and transitions as an LQG system around the current mode of the belief within a time slice. Concise derivations of the messages for AICO are given in Appendix 1, as they are used to further extend the algorithm to implement control constraints.

**O**

_{t}is the inferred feedback control gain matrix and

**o**

_{t}is the linear feedback controller term. For our evaluations we also use a complexity measure that is proportional to the variance of this feedback control law:where the dimension of the states is denoted by

*D*

^{x}and the dimension of the controls is denoted by

*D*

^{u}.

AICO is only a local optimization method, and we have to provide an initial solution that is used for the first linearization. We will simply use the first state as the initialization: **x**_{2:T} = **x**_{1}.

If we use AICO with a constant cost and dynamic model for each time step, the algorithm reduces to calculating a linear-quadratic regulator (LQR), which is often used in optimal control. An LQR is the optimal linear feedback controller for a given linear system. In contrast, AICO uses a time-varying linear feedback controller, which may be different for each time step. In our experiments we will compare the two approaches on a dynamic nonlinear balancing task. Thus we evaluate the benefit of using AICO (time-varying linear feedback control ) and an LQR (constant linear feedback control ).

### 2.2 Cost Function for Constrained Systems

*c*

_{t}

^{u}(

**u**

_{t}) to the cost function in Equation 1. Thus for a trajectory τ we specify the costsWe use this additional term to punish controls

**u**

_{t}that exceed a given bound

**u**

_{Bt}at time

*t*:As a consequence, the resulting Gaussian distributions that are used to represent the costs

*c*

_{t}change. These distributions typically have zero mean in the control space, due to the typically used quadratic control costs. In the case of control limits, the mean of the distribution is nonzero. Consequently, the message-passing update equations are also used for the AICO algorithm change. The exact message-passing equations of AICO with control limits are presented in the Appendix.

## 3 Experiments

We investigate morphological computation combined with optimal control on dynamic nonlinear balancing tasks [2], where a robot gets pushed with a specific force and has to move so that it maintains balance. The optimal strategy is a nonlinear control law that returns the robot to the upright position. For our evaluations, different initial postures and multiple forces are used, as sketched in Figure 2. We study the morphological computation by changing different physical properties of a four-link model of a humanoid robot. In particular, we investigate the computation done by different friction coefficients and link lengths. Inspired by the work of Hauser et al. [5], we also incorporate springs into our model for analyzing different spring constants.

### 3.1 Setting

We use a four-link robot as a simplistic model of a humanoid (70 kg, 2 m) [2]. The eight-dimensional state **x**_{t} is composed of the ankle, the knee, the hip, and the arm positions and their velocities. Table 1 in Appendix 2 shows the initial velocities (resulting from the force *F*, which always acts at the shoulder of the robot) and the valid joint angle range for the task. In addition to the joint limits, the controls are limited to the intervals [±70, ±500, ±500, ±250]Nm (ankle, knee, hip, and arm). For more detail we refer to Atkeson and Stephens [2].

The movement trajectories are simulated with a time step of 0.5 ms. For the AICO algorithm, we use a planning time step of Δ*t* = 5 ms and a horizon of *T* = 500, which results in a movement duration of 2.5 s. We use a torque-dependent noise model—the controls are multiplied by Gaussian noise ϵ with mean 1 and standard deviation 0.25, that is, we use 25% torque-dependent noise. The noise affects the system dynamics while simulating a trajectory, where . The experiments are performed for multiple forces sampled from , as shown in Figure 2.

### 3.2 Complexity Measures

We evaluate the average values of the final costs *L*(τ) in Equation 1, the variance of the time-varying controller gains returned by the AICO approach in Equation 2, and the total jerk of the trajectory (proportional to the squared derivative of the torque). Different complexity measures would be possible. However, the chosen ones are plausible, since the complexity of controlling the robot is reflected by how well the costs *L*(τ) are optimized. As the jerk of a movement tracks the derivatives of the torques, this also seems to be a reliable complexity measure. Finally, the variance of the time-varying controller gains quantifies the complexity of the control law in comparison with linear controllers (no variance).

### 3.3 Friction Induces Morphological Computation

In this experiment, we evaluate the influence of the friction coefficient γ on the quality of the optimal controller for the four-link model. The friction coefficient directly modulates the acceleration of the joints, that is, .

Figure 3a shows the resulting final costs for different friction coefficients. In order to illustrate the need for sophisticated control architectures, we compare the optimal control method AICO with a LQR controller and a simple constant linear feedback controller. While AICO calculates the optimal control law for the nonlinear system, the LQR controller linearizes the system at the upright position and subsequently calculates the closed form solution of the optimal controller for the resulting LQG system. The constant linear feedback controller does not adapt to the morphology of the robot. As we can see from Figure 3a, we cannot determine the optimal morphology with the simple linear controller, while we can recognize a clear minimum with the AICO and the LQR controller.

AICO could find the simplest control law according to the jerk criterion in Figure 3b and was also able to find control laws with considerably decreased costs. In Figure 3c, we illustrate the variance of the time-varying controller gains. The complexity of the resulting controllers is reduced by changing the friction coefficient. The minimum lies in the same range as the minimum of the cost function. Thus, in this case, a simpler controller resulted in better performance because the morphology (the friction) of the robot has simplified the control task. Still, we needed a more complex optimal control algorithm to discover this simpler control law than that given by the LQR.

In Figure 4 we depict the joint and torque trajectories for the hip and the arm joint, where we applied the friction coefficients γ = 0, γ = 12, and γ = 30. With γ = 0, the trajectories overshoot as shown in Figure 4a, whereas in Figure 4c more energy is needed for stabilizing the robot with γ = 30. The trajectories for the optimal friction coefficient γ = 12 (according to the jerk complexity measure) are shown in Figure 4b.

### 3.4 The Influence of the Link Lengths

To show the effects of morphological computation on a more complex variation of morphologies, we consider in this experiment the variation of the shank (i.e., the calf), the thigh, the trunk, and the arm length, **l** = [*l*_{1}, *l*_{2}, *l*_{3}, *l*_{4}]. Initially the link lengths are **l**_{0} = [0.5, 0.5, 1, 1] m and the weights of these links are specified by **m**_{0} = [17.5, 17.5, 27.5, 7.5] kg.

We consider variations of **l** of up to 50% from their initial configuration. Different link lengths of the four-link robot result in different initial velocities [2], which influences the complexity of the motor control problem. For this reason, we consider constant inertias for a fair comparison. Thus, for each link length **l** the masses of the links are given by **m** = **m**_{0}**l**_{0}^{2}/**l**^{2}. To match the physical properties of humans we assume equal lengths for the shank and the thigh link: *l*_{1} = *l*_{2}.

In addition to the link lengths of the thigh, the trunk, and the arm, we still optimize the friction coefficient, resulting in four parameters (*l*_{2}, *l*_{3}, *l*_{4}, and γ) of our morphology. A naive search for optimal parameters as in the previous experiment is intractable in the multidimensional case. Therefore, we apply the nonlinear stochastic optimizer of Hansen et al. [4] as a local search method.

The stochastic optimizer locally explores the parameter space (*l*_{2}, *l*_{3}, and *l*_{4}) until convergence. For example, the resulting final cost values for different link lengths are illustrated in Figure 5 when using AICO as the optimal control law. For this experiment, we used a friction coefficient of γ = 12, and we averaged over multiple initial states and multiple forces. The optimizer converged to link lengths of *l*_{2} = 0.39, *l*_{3} = 1.18, and *l*_{4} = 0.5 for the thigh, the torso, and the arm.

For multiple friction coefficients, an evaluation of the presented complexity measures is shown in Figure 6. According to the final costs in Figure 6a, the best controller was found with a friction coefficient of γ = 9. For the jerk criteria in Figure 6b and the variance as complexity measure shown in Figure 6c, the minimum of the complexity measures lies in the same range.

The morphologies found by the optimizer for the AICO and the LQR controller are sketched in Figure 7 for the friction coefficients γ = 0, γ = 9, and γ = 30. The initial configuration of the robot is shown in Figure 7a and in Figure 7e. Interestingly, with increasing values of the friction coefficients, the link lengths change to compensate for the larger motor controls. This is shown in Figure 7b–d for the AICO controller and in Figure 7f–h for the LQR controller. AICO could again find significantly different morphologies from the LQR controller. As is illustrated in the labels of Figure 7, the morphologies found by AICO could produce lower costs than the morphologies found by LQR.

### 3.5 Robot Model with Linear Springs

Inspired by the theoretical analysis of [5], we also evaluate the computational power of linear springs in our model. The springs are mounted on the four-link model as illustrated in Figure 8. These springs support the control of the robot as they push it back into its upright position. Thus, the torques applied to the joints are given by the sum of the controls provided by the feedback controller and the spring forces **u**_{t}∗ = **u**_{t} + diag(**k**)**Δl**_{s}. The vector **Δl**_{s} contains the displacements of the four springs, and **k** is a vector containing the four spring constants. Note that due to the calculation of the displacements, the springs act nonlinearly on the joint torques.

The adjustable parameters of the morphology include the friction coefficient and the four spring constants **k** = [*k*_{1}, *k*_{2}, *k*_{3}, *k*_{4}] for the ankle, the knee, the hip, and the arm joint. Initially the spring constants are set to **k**_{0} = [1, 10^{3}, 10^{3}, 1]. For the optimization, a lower bound of 0.1**k**_{0} and an upper bound of 100**k**_{0} are used.

As in the previous experiment, for different friction coefficients γ, we optimize the spring constants using the stochastic search method of Hansen et al. [4]. The results for the final costs, the total jerk, and the variance of the feedback controller as complexity measures are shown in Figure 9. The benefit of the supporting springs is best illustrated by the final costs (AICO: 109.5; LQR: 122.2) in Figure 9a, which are drastically reduced compared to the final costs using the four-link model without springs (AICO: 146.9; LQR: 246.9) shown in Figure 3a.

The optimal spring constants strongly depend on the friction coefficient used. This is illustrated in Figure 10, where we compare the learned spring constants using AICO and LQR as control methods. With increasing friction, the spring constants increase to compensate for larger control costs.

### 3.6 Concluding Summary

We have shown that by using optimal control methods an optimal morphology could be found for the four-link robot model. This was evaluated on three tasks with increasing complexity: optimizing only the friction coefficient, optimizing the friction coefficient and the link lengths, and finally, optimizing the friction coefficient and the four spring constants. We can now also ask how beneficial it was to optimize the link lengths compared to optimizing the spring constants. This is best reflected by the achieved final costs in Figure 3a, Figure 6a, and Figure 9a, where the most complex task also performs best.

## 4 Conclusion

In this article, we have shown that optimal control and morphological computation are two complementary approaches that can benefit from each other. The search for an optimal morphology is simplified if we can calculate an optimal controller for a given morphology. This calculation can be done by new approaches from probabilistic inference for motor control, such as the approximate inference control algorithm [16]. By the use of optimal control methods, we have shown that for dynamic nonlinear balancing tasks, an appropriate setting of the friction, the link lengths, or the spring constants of the incorporated springs of the compliant robot can simplify the control problem.

We have demonstrated that there are optimal values for the physical properties for a given control task, which can only be found by the use of optimal control laws. With naive control architectures such as the evaluated constant linear feedback controller, which does not adapt to the morphology, different, suboptimal morphologies would be chosen.

In our studies, we assumed that we can always compute an optimal controller given the current morphology. If this is possible, then the morphological computation can be quantified using the proposed method. However, for computing optimal control laws (using, e.g., AICO), we need to know the costs at each time step, as well as the dynamics model. For more complex tasks, these costs and the model are usually unknown. However, as argued in the related work in Section 1.1, for model learning, many types of function approximators can be applied [9, 10, 18]. For learning the cost function, inverse reinforcement learning methods such as those of Abbeel and Ng [1], and Boularias et al. [3] could be used.

In the future, we plan to investigate more complex and more nonlinear tasks. In these cases the benefit of AICO in comparison with LQR controllers should be even more prominent. In the end we are planning to simultaneously evolve walking controllers based on AICO and the morphology of bipedal robots, such as a model of a planar walker.

## Acknowledgments

This article was written with partial support by the European Union projects FP7-248311 (AMARSI) and FP7-506778 (PASCAL2). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

## Notes

In this article the immediate cost function is composed of the intrinsic costs and the constraint costs: *c*_{t}(**x**_{t}, **u**_{t}) + *c*_{p}(**x**_{t}, **u**_{t}).

## References

### Appendix 1: Extension of the Approximate Inference Control Method

The original formulation of the AICO method [16] does not consider a linear term for the control costs. However, this is needed to encode torque limits, which are important for our dynamic balancing experiments, and hence we needed to extend the AICO algorithm.

The introduction of a linear term for the control costs yields not only a modified cost function, but also different update equations for the messages, and finally different equations for the optimal feedback controller. For completeness we will first recap the main steps in deriving the AICO method, and will then discuss the modifications to implement control constraints.

#### A1.1 Approximate Inference Control Without Torque Limits

*P*(

**u**

_{t}|

**x**

_{t}) is the state-dependent prior for the controls, the distribution

*P*(

**x**

_{t}|

**x**

_{t−1},

**u**

_{t−1}) is the state transition model, and

*P*(

**x**

_{1}) is the initial state distribution. Here, we assume that the prior of the controls is independent of the states, and thus we will simply write

*P*(

**u**

_{t}|

**x**

_{t}) =

*P*(

**u**

_{t}) for the rest of this appendix. The time horizon is fixed at

*T*time steps. The binary task variable

*z*

_{t}denotes a reward event; its probability is expressed by

*P*(

*z*

_{t}= 1|

**x**

_{t},

**u**

_{t}) ∝ exp(−

*c*

_{t}(

**x**

_{t},

**u**

_{t})), where

*c*

_{t}(

**x**

_{t},

**u**

_{t}) is the immediate cost function

^{1}for time step

*t*. It expresses a performance criterion (like avoiding a collision, or reaching a goal).

*P*(

**x**

_{1:T},

**u**

_{1:T}|

*z*

_{1:T}= 1) over trajectories, conditioned on observing a reward (

*z*

_{t}= 1) at each time step

*t*. This posterior can be computed by using message passing in the given graphical model of Figure 1. To simplify the computations, we integrate out the controls:

*f*

_{Dyn}is approximated by a linear dynamics quadratic costs, and Gaussian noise (LQG) system, by Taylor expansion [15, 16]:Thus, the system is linearized along a given trajectory at every point in time. We will use

*f*

_{t}as shorthand for

*f*

_{Dyn}(

**x**

_{t},

**u**

_{t}). Then, the state transition matrix

**A**

_{t}is given by , the control matrix

**B**

_{t}is , and the linear term reads

*c*

_{t}is approximated asNote that there is no linear term for the control costs, as we only punish quadratic controls. We can now write

*P*(

*z*

_{t}= 1|

**x**

_{t},

**u**

_{t}) =

*P*(

*z*

_{t}= 1|

**x**

_{t})

*P*(

**u**

_{t}) aswhere the distributions are given in canonical form. The canonical form of a Gaussian is used because numerical operations such as products or integrals are easier to calculate in this notation. The canonical form is indicated by the square-bracket notation and given byA Gaussian in normal form can always be transformed into the canonical form by = . For more details we refer to the Gaussian identities in Toussaint [17].

**u**

_{t}is given by the control costs, that is, . By integrating out the controls from our system dynamics we get the following state transition probabilities:where the integral was evaluated using a reformulation of the propagation rule in Toussaint [17].

As we can see, all distributions in the approximated LQG system in Equation 12 are Gaussian, and thus also all messages are Gaussians and can be calculated analytically. The resulting messages are given in Toussaint [16].

#### A1.2 Approximate Inference Control with Torque Limits

In order to implement torque and joint limits, we introduce an additional cost function *c*_{p}, which punishes the violation of the given constraints. The function *c*_{p} is just added to the current immediate costs. We use separate cost terms for control constraints *c*_{t}^{u} and joint constraints *c*_{t}^{x}, that is, *c*_{p}(**x**_{t}, **u**_{t}) = *c*_{t}^{x} (**x**_{t}) + *c*_{t}^{u}(**u**_{t}). Here, we will only discuss how to implement the function *c*_{t}^{u}(**u**_{t}) for the torque constraints; joint constraints are implemented similarly.

*c*

_{t}

^{u}is quadratic in

**u**and punishes leaving the valid control limits of

**u**. In order to implement the upper bound

**u**

_{max}for the torques, we use the following cost function:where the matrix

**H**

_{t}contains the quadratic control costs. The constrained costs are only imposed for the control variable

*u*

_{i}if the torque value exceeds the upper bound

*u*

_{max, i}. In order to do so,

**H**

_{t}

^{U}is taken as a diagonal matrix where the

*i*th diagonal entry is zero if

*u*

_{i}≤

*u*

_{max, i}and nonzero otherwise. The lower bound

**u**

_{min}is implemented likewise, using an individual diagonal matrix

**H**

_{t}

^{L}.

Joint limits can be imposed similarly by using additional terms as costs for *c*_{t}^{x}(**x**_{t}). However, for joint limits, the update equations stay the same, because *P*(*z*_{t} = 1|**x**_{t}) already has a nonzero mean, denoted by **r**_{t} in Equation 9.

*c*

_{t}

^{u}(

**u**

_{t}) contains a nonzero linear term

**h**

_{t}, we get a new linear term in the transition dynamics. The forward and backward messages are the same as Toussaint [16] except that

**a**

_{t}is replaced by , and

**H**

_{t}by .

*b*

_{t}(

**x**

_{t}) = (

**S**

_{t}+

**V**

_{t}+

**R**

_{t})

^{−1}(

**s**

_{t}+

**v**

_{t}+

**r**

_{t}) (multiplying three canonical messages and a subsequent transformation to normal form). Furthermore, we use the shorthand + for the covariance in Equation 14.

*P*(

**u**

_{t}|

**x**

_{t}) =

*P*(

**u**

_{t},

**x**

_{t})/

*P*(

**x**

_{t}), and the solution iswhere and . After a reformulation we can obtain an optimal feedback controller of the form

**u**

_{t}=

**o**

_{t}+

**O**

_{t}

**x**

_{t}with

Similarly to Toussaint [16], we use an iterative message-passing approach where we approximate the nonlinear system by linear dynamics, quadratic costs, and a Gaussian noise (LQG) system at the new mode of the trajectory. In Toussaint [16], this is done by using a learning rate on the current modes of the belief. However, in contrast to Toussaint [16], we also need an estimate of the optimal action **u**_{t} in order to impose the control constraints. Using a learning rate on the control action **u**_{t} turned out to be very ineffective, because feedback is attenuated. For this reason we will instead use a learning rate on the feedback controller.

The complete message-passing algorithm, considering state and control constraints, is listed in Algorithm 1. This is a straightforward implementation of Gaussian message passing in linearized systems, similar to an extended Kalman smoother:

In Toussaint [16] and Rawlik et al. [13], more time-efficient methods are presented, where for each time step the belief is updated until convergence, in contrast to updating all messages and iterating until the intrinsic costs *L*(τ) converge. The computational benefits of such an approach still need to be evaluated for our messages.

### Appendix 2: Four-Link Robot Model Specifications

For the four-link model, the eight-dimensional state is composed of the ankle, the knee, the hip, and the arm positions and velocities. In our experiments, the velocities are instantaneously affected by the applied force *F* [2]. These initial velocities and the valid joint angle ranges are shown in Table 1.

Joint . | Initial velocities . | Lower bound . | Upper bound . |
---|---|---|---|

Ankle | +1.2 × 10^{−2}F | −0.8 | 0.8 |

Knee | −7.4 × 10^{−2}F | −0.05 | 2.5 |

Hip | +5.1 × 10^{−2}F | −2.0 | 0.1 |

Arm | −4.2 × 10^{−2}F | −0.6 | 3.0 |

Joint . | Initial velocities . | Lower bound . | Upper bound . |
---|---|---|---|

Ankle | +1.2 × 10^{−2}F | −0.8 | 0.8 |

Knee | −7.4 × 10^{−2}F | −0.05 | 2.5 |

Hip | +5.1 × 10^{−2}F | −2.0 | 0.1 |

Arm | −4.2 × 10^{−2}F | −0.6 | 3.0 |

In addition, we use multiple initial joint angles. The ankle angle is given by ϕ_{1} = −α, the knee angle by ϕ_{2} = −αf, the hip angle by ϕ_{3} = αf, and the arm joint angle by ϕ_{4} = 2αf, where α is sampled from and the linear factor is uniformly distributed with . Four example initial states are illustrated in Figure 2.

## Author notes

Contact author.

Institute for Theoretical Computer Science, Graz University of Technology, Inffeldgasse 16b/1, Graz, 8010, Austria. E-mail: rueckert@igi.tugraz.at

Intelligent Autonomous Systems, Technische Universität Darmstadt, Hochschulstraße 10, 64289 Darmstadt, Germany. E-mail: neumann@ias.tu-darmstadt.de