## Abstract

Human movement differs from robot control because of its flexibility in unknown environments, robustness to perturbation, and tolerance of unknown parameters and unpredictable variability. We propose a new theory, risk-aware control, in which movement is governed by estimates of risk based on uncertainty about the current state and knowledge of the cost of errors. We demonstrate the existence of a feedback control law that implements risk-aware control and show that this control law can be directly implemented by populations of spiking neurons. Simulated examples of risk-aware control for time-varying cost functions as well as learning of unknown dynamics in a stochastic risky environment are provided.

## 1 Introduction

All movement by humans and animals is performed in a risky environment with mostly unknown and unpredictable dynamics. Awareness, prediction, and avoidance of risk are so fundamental to natural movement that we expect our movement to change when, for instance, we carry a cup of hot coffee, walk near a cliff, carry a baby, or move our arm between glassware on a table. If we observe another person whose movement does not change appropriately in response to risk, we consider him or her to be a risk taker, clumsy, or insensitive to his or her environment. Yet despite the fundamental importance of risk awareness for natural movement, most current theories of human motor control address this only minimally, if at all.

We use the term *risk* to indicate the expected cost of behavior. For example, in certain cases, risk is the product of the probability of error and the cost of error. It depends on the coincidence of high-cost states and uncontrollable or unpredictable components of movement (which might include noise, perturbations, or unknown dynamics). Successful avoidance of high-cost states or uncontrolled dynamics near low-cost states do not lead to high risk. Natural response to anticipated risk involves (1) choosing a path that avoids high-risk regions, (2) setting feedback or reflex responses to compensate for unexpected deviations toward high-risk regions, and (3) reducing movement variability and uncertainty.

Although risk is only rarely addressed in theories of human motor control, there is a large literature on risk within the field of optimal control theory. Risk (in the form of a cost function and noise model) is used for planning an optimal trajectory, but execution of that trajectory is then achieved using a fixed feedback controller (Chow, 1976; Davis & Vinter, 1985). This means that during movement, two different perturbations of equal magnitude away from the optimal trajectory will be pushed back toward that trajectory without considering whether there is a difference in the risk caused by those perturbations. Risk-sensitive optimal control (Nagai, 1996; Whittle, 1990) introduces an exponential nonlinearity in the cost function that allows optimization relative to risk-averse or risk-seeking behavior.

But the goal during execution remains control of a precalculated optimal trajectory. This structure results from simplifying assumptions including local linearity, additive gaussian noise, and certainty equivalence, meaning that the system behaves as if its best estimate of the current state is in fact the true current state. These assumptions are problematic as models of human movement. For example, there are many situations in which there is no single desired path (e.g., one might drive anywhere on a wide road), the noise is neither additive nor gaussian (e.g., quantized noise or action-dependent—multiplicative—noise), dynamics are nonlinear (e.g., muscle-like actuators and biological proprioceptors), and certainty equivalence does not hold (e.g., behavior in high-risk environments depends on the amount of sensory uncertainty).

More recently, optimal feedback control theory (Todorov & Jordan, 2002) reestimates the optimal trajectory at each time point, and by so doing, it automatically compensates for perturbations. A perturbation away from the desired trajectory may result in a corrective response toward the goal rather than back to the original trajectory. The computational complexity of continual reoptimization requires many simplifying assumptions, and current implementations do not take uncertainty or complicated cost functions into account. In particular, there is no change in behavior if the current state is uncertain or if there is increased variability or uncertainty in the results of actions. However, a full implementation of optimal feedback control (OFC; perhaps with precalculated perturbation responses) will be risk aware if it incorporates state uncertainty, stochastic dynamics, and arbitrary cost functions or prior beliefs about the value of outcomes.

None of the classical theories accounts for the common observation that muscle tone changes prior to an expected perturbation even if the perturbation has never previously occurred (Bouisset & Do, 2008; Massion, 1992). Reflexes are adjusted based on predicted risk so that corrective movements are preplanned. This human behavior differs from OFC (Todorov & Jordan, 2002) because in OFC, the response to perturbation is computed only after the perturbation has occurred. It can be thought of as “planning for errors” and is an obvious and necessary component of successful human behavior and motor performance.

Awareness, estimation, and responsiveness to risk improve during development. Every parent is familiar with a child placing a cup too close to the edge of a table, gripping a sharp object improperly, or not fixing an untied shoe. When challenged, the child will respond that nothing has gone wrong yet, so there is no reason to worry. But the parent, with greater experience, is aware of the risks, including the unlikely but nonzero probability of a disastrous outcome, and therefore will protect or guide the child to reduce the risk. Children must learn to estimate risk, including the probability of perturbation, the cost of unlikely errors, and the planning and reflexes necessary to reduce costs for rare perturbations or unexpected uncertainty.

Differences in risk sensitivity explain some of the differences in behavior between different humans. Estimation of risk involves estimation of probability of error and cost of error. Two people may differ in their assessment of either of these and therefore will differ in their planning behavior. For instance, differences in estimation of the probability of an unusually large wave (a “rogue wave”) may cause different people to sit nearer or farther from the water at the beach. Differences in the estimated cost of a broken leg may cause different people to choose whether to ski. Estimation is likely to be particularly difficult for rare events, since there will not be enough examples to calculate the true probability of occurrence or cost. Nevertheless, compensation for rare but potentially devastating events is an important element of survival, since fatal events do not permit acquisition of multiple samples for estimation of statistics.

Therefore, any complete understanding of human motor function must take into account risk awareness. In order to do so, we need a mathematical theory that can describe the interaction between risk, uncertainty, planning, and the response to perturbation. We have previously described the framework of likelihood operators (Sanger, 2010a, 2011) and here claim that this framework provides direct support to understanding and investigating risk-aware behavior in humans and animals. Furthermore, this framework has a direct link to neural implementations, so that it may predict not only behavior but also the internal representation and computation underlying that behavior. In the following sections, we describe the mathematical framework and show some examples of applications to understanding human behavior.

## 2 Theory

At least three elements contribute to risk: uncertainty about current state, uncertainty about the effect of actions, and cost of errors.

### 2.1 State Uncertainty

We represent state by a scalar or vector and the uncertainty about state by a probability distribution representing our relative belief in each different possible value of the state. is not the true probability of occurrence of the state, but rather our belief in the probability of occurrence. This belief is usually based on a set of prior measurements *s*, and thus can be interpreted as , the conditional probability of state given all prior available information. State uncertainty may be due to random effects such as unmeasured noise in the observations or to deterministic effects such as quantized, band-limited, or intermittent observations.

### 2.2 Action Uncertainty

*u*, and if we choose a particular action at a particular instant of time, it causes a change in state that depends on the choice of action and the current state: Since we do not know the state exactly but instead know only an estimate of the probability of state, we can use the Kramers-Moyal expansion (see Gardiner, 1985) to obtain a high-dimensional linear system with equivalent stochastic dynamics, where is a linear operator that depends on the choice of

*u*. (A well-known example of such an operator

*L*for certain physical systems is the Fokker-Planck equation, also known as the forward Kolmogorov equation). If takes on only a discrete set of values, then is a matrix. If can take continuous values, then is an operator, and we can write equation 2.2 as It may seem surprising that for any nonlinear system 2.1, the dynamics of are always described by a linear equation of the form of equation 2.2. This fact becomes more intuitive if we consider the deterministic case in which is known so that is a Dirac delta function. Then the “column” can specify any desired output function .

*L*(, ) must always be nonpositive, meaning that probabilities cannot increase themselves; they can only increase due to the “flow” of values from other places, and they can only decrease by “flowing” elsewhere. In the discrete (matrix) case, this means for , and .

Note also that *L* indicates the uncertainty in the effect of action. Even if we have exact knowledge of before we choose action *u*, the result will usually not permit exact knowledge of after the action. Mathematically, this means that even if *p*(, ) is a Dirac delta function that is nonzero only for a single value of , application of the linear operator will usually not be a delta function, and so there will be increases in the probability of many different values of .

*L*as the differential operator: This is easier to understand in the matrix case. Suppose that can take on discrete values from . Then is an

*N*-vector, and

*L*is an matrix. The “push-left” operator as a matrix is which means that each element decreases proportional to its own value while increasing proportional to the value to its right, until it is replaced by the value to its right. In general, you can produce many different types of operators this way. Consider a matrix operator that attracts toward a stable central region: Discrete-time probability operators must satisfy and , so the requirements on the

*L*operators in these examples are that the column sums must be zero and all off-diagonal elements must be nonnegative.

If the underlying state has more than one dimension, then the operators *L* are tensors, oscillatory or chaotic behavior is possible, but the same linear equation describing the stochastic dynamics continues to hold. It is important to realize that such behavior may lead to nonzero curl in the vector field describing the dynamics, and thus such dynamics cannot be the solution of an optimization problem described by a time-invariant cost function. In our case, such behavior would be desirable if the selected dynamics match the desired dynamics or if the time-varying cost function is itself oscillatory. This is a significant difference from standard optimal control (in which solutions must be curl free since they are derived from time-invariant or quasi-static cost functions), but similar to solutions obtained using active inference (Friston, 2011).

### 2.3 Cost of Errors

*p*(, ), but since the total expected cost depends only on the average cost for each state, we would not gain anything by the added complexity.

*u*and ask what the expected change in cost will be due to that action (where we are uncertain about the state we start in and the effect of the action). As Sanger (2010b) showed, we can write which we can write more succinctly if and

*p*are vectors and

*L*is a matrix as

### 2.4 Feedback Control

*u*is really a binary vector with only a single nonzero element that allows us to choose one discrete action at a time. Then we can write where

*L*is the value of when the

_{i}*i*th element of

*u*is 1 and all other elements are 0. The expected cost (value) rate is and we can maximize by choosing the control

*u*for which

_{i}*vL*is greatest. For each control

_{i}p*u*,

_{i}*vL*is the expected change in value if that control were to be activated, so this rule simply says to choose the control that is most likely to cause the greatest increase in expected value (or greatest decrease in cost).

_{i}p*u*for which A more general control model might be the linear superposition (mixture of flow modes), where the

*u*’s are constrained so that the total output power . Then the optimal solution is to activate each control according to so that each control output is proportional to how much its activation will improve the value of the state. The overall structure is shown in Figure 1.

_{i}For nonlinear systems such as equation 2.1, certainty equivalence does not hold, so the standard optimal solutions are incorrect. More important, in the absence of certainty equivalence, we need to keep track not just of the state but also of the complete probability of state *p*(, ). This is why our formulation is particularly useful. We use the estimated probability *p*(, ) as our “state” at each point in time. Because of this, we have the added benefit that the system dynamics can be described as a linear system 2, even though the underlying dynamics (see equation 2.1) are nonlinear.

Now, instead of a reference trajectory , we use a time-varying value function (, ), and instead of the state , we use the time-varying probability density of state *p*(, ). The value function specifies not a single desired state but the relative value of every possible state. It is certainly possible to specify a single desired state with a mean-squared cost of error by setting , but this is only one possibility and using (, ) gives much greater flexibility. For instance, (, ) could indicate that the goal is to remain on a road, but where exactly on the road does not matter so long as we do not drive off the edge. In this case, (, ) might have a constant value of zero on the road but a large negative value (or cost) everywhere else (see Friston & Ao, 2012, for a similar example). There is no need to specify a particular path down the road; all places on the road could have equal value. There is no need to waste energy trying to resist perturbations that do not push us off the road. By using time-varying value functions and state probability densities, we obtain a feedback controller that takes into account state uncertainty, control uncertainty, and a general cost function.

*u*there is a cost proportional to the magnitude of control . Then the total rate of change in value is and we can maximize this by setting each control

_{i}*u*proportional to .

_{i}Stability for this class of feedback controller is determined not by a feedback gain but by the properties of the operators *L _{i}*. For example, an operator for a first-order system will be stable if it generates a flow toward a single stable equilibrium value of the state, and unstable otherwise. An operator for a second-order system may implement a stable oscillation (limit cycle) or convergence to one or more stable equilibria. Linear combination of stable operators will always be stable, and linear combinations that include unstable operators may or may not be stable. When the combination is determined by a value-driven feedback law as in equation 2.20, then if the value function has a peak, operators will be selected that stabilize near that peak. However, if the value function is primarily a valley (a “keep-away” zone), then unstable operators may be chosen if they avoid the valley. In general, specification of a value function with a finite set of maxima will cause selection of operators that stabilize near these maxima.

### 2.5 Reflexes

Suppose we have implemented a risk-aware feedback control law of the form of equation 2.15. What effect does this have on the response to perturbations? It is often easiest to understand this by initially examining a classical feedback controller as in equation 2.21: . The response to an increase in the reference position is the same as the response to a decrease in the estimated state . In other words, a perturbation of the state in the negative direction will lead to compensation in the opposite direction, proportional to the gain constant *K*.

*u*is to push to the left. In this case, the operator

*L*can be written as so that activation of

*u*in the absence of perturbation would cause Now we can write The second derivative of

*p*is low at the peaks and high at the valleys. So whenever a peak lines up with a high value or a valley lines up with a low value , will be high. In other words, if the current state has high overall value (high values occur with high probability ),

*u*will increase. Conversely, if the current state has low overall value,

*u*will decrease. Since

*u*pushes left while the perturbation pushes right, this means that

*u*will be activated to resist the perturbation only if doing so maintains a high value state. The amount of activation of

*u*will be proportional to how good the current state is, the size of the perturbation, how certain we are about the current state (high certainty means sharp peaks and broad valleys of ), how rapidly the value changes with state, and the effect of activating

*u*on restoring the state.

This demonstrates a very important property of reflex behavior in risk-aware control: *only perturbations that decrease value are resisted.* The purpose is not to stabilize the system but to increase value or prevent decrease in value. If a perturbation causes an unexpected increase in value, you certainly do not want to resist it. For instance, the controller will be very resistant to a perturbation that causes you to fall off a cliff, less resistant to a perturbation that causes you to gently brush a wall, zero for a perturbation that does not affect your progress, and negative (meaning it assists the perturbation) for a perturbation that increases safety or pushes you in the direction you want to go.

Of course, since the value function (, ) is time varying, the feedback control tends to keep you in high-value regions for the current value function. For example, if the purpose is to move slowly along a path toward a final goal and this is implemented by shifting the peak of the value function slowly along that path, then a perturbation that occurs in the middle of the path will result in restoring control toward the current peak of the value function, not toward the final goal. (This is the risk-aware version of equilibrium-point control (Bizzi, Accornero, Chapple, & Hogan, 1984).) Of course, movement could also be achieved by having a constant with a peak at the end point, but this would be appropriate only if the timing of movement or the time of arrival at the end point were unimportant to the success of the task. In general, the formulation proposed here allows a very flexible form of compliant control, in which effort is expended to stabilize movements only when stabilization is important to task performance.

### 2.6 Optimization

Note that this formulation does not solve classical optimal control problems. In optimal control, the cost is specified at some future time *T _{f}* and the goal is to find a sequence of controls that achieves the best possible future outcome. Such problems are usually solved using some variation of the Hamilton-Jacobi-Bellman equation (Bellman, 1957), which generally works by estimating the current cost (, ) in terms of the final cost and the best guess as to the optimal future controls . Here, we assume that the current cost is known. We do not include any estimate of the relative cost of actions, so that all actions

*u*are equally easy to achieve and differ only in their effects on probability of state .

Therefore, the control law in equation 2.15 is just that: a control law, not an optimizer. It is worthwhile to contrast the behavior of the control law used with a cost function to general stochastic optimization. As noted above, the goal of an optimizer is to estimate the current value function (, ) in terms of the value at some future time, taking into account the possible cost of movement and in some cases the probability of errors or uncertainty in the effect of actions . In almost all cases, it is assumed that the true state is known exactly, although if it is not, then the value can be identified with the expected value (over *p*(, )) of . Solving the optimization problem in this way requires knowledge of future uncertainty in both state and the effects of actions. This is usually solved only for uncertainty due to additive gaussian noise of known variance, which limits its applicability since uncertainty is frequently time varying.

In contrast, the control law here assumes that (, ) is known but may have been computed without taking uncertainty into account. Uncertainty is incorporated in the control law through the use of *p*(, ) (which represents uncertainty in the current state) and the stochastic dynamics *L* (which represent uncertainty in the effect of actions for each state).

A full implementation of OFC (Todorov & Jordan, 2002) will have risk-aware behavior. For example, if a Bellman-type rule is used to precalculate the future discounted cost function (, ) at each point in time, then OFC will force the state to descend the local gradient of (, ), which is essentially the same as for the control law proposed here. However, most current implementations of OFC do not take control uncertainty, model uncertainty, or state uncertainty into account, so they may not plan appropriately risk-aware actions.

Furthermore, there is an essential difference in implementation. Risk-aware control suggests that the dynamics be selected in order to minimize the expected cost of perturbations or uncertainty. While this may be done by a feedback rule, it can also be done by changing the dynamics or mechanical properties themselves (e.g., by co-contraction of muscles or selection of a posture that resists perturbations in certain directions). In contrast, OFC depends on high-speed feedback control with instantaneous sensory estimates and actuator responses.

### 2.7 Computational Burden

In standard feedback controllers, state is represented by a vector of variables (usually high-precision floating-point numbers), and implementing the feedback gain requires a matrix multiplication where the size of the matrix is equal to the number of variables. In risk-aware control, state is represented by the complete joint density over all state variables, and the feedback gain requires a separate matrix multiplication for every control variable, where the size of the matrix is exponential in the number of control variables. Therefore, the computational burden is significantly higher, and full implementation is untenable for high-dimensional state space.

However, the operations required for risk-aware control can be easily parallelized, and the benefit of the higher complexity is that iterative optimization is unnecessary so long as the one-step-ahead cost function is available. Furthermore, response to perturbations is effectively precomputed, providing very rapid response to perturbations whenever they will have a significant effect on task performance.

Approximate methods can be used to speed up the calculations. For instance, the probability densities can be projected onto lower-dimensional spaces, similar to the Galerkin method (Galerkin, 1915) and finite-element approximations. Alternatively, since *vL _{i}p* is the expected change in value from activating control

*u*, a neural network approximation could be used to learn the relation between ,

_{i}*p*, and

*u*that increases value. The matrix operators

_{i}*L*are usually very close to diagonal, so the matrix multiply operation can be approximated by a convolution in many cases.

_{i}### 2.8 Spiking Distributed Neural Networks

If risk-aware control is a model for biological movement under risk, it is interesting to speculate whether a neural-like implementation exists or even whether a neural-like implementation might provide increased efficiency compared with standard computational hardware. In this section, we explore whether populations of rate-coded (Poisson) spiking neurons can be used to represent dynamics under the risk-aware control framework.

The overall framework of risk-aware control maps nicely onto distributed neural representations. Because state is represented as a probability density rather than a value, it is most naturally expressed as a distributed representation in which different neurons code for the probability that the state is in different conditions (Ma, Beck, Latham, & Pouget, 2006; Pouget, Dayan, & Zemel, 2000; Sanger, 1998). The state itself is implicitly coded by which neurons are firing the most (indicating the highest probability of being in a particular state). A similar consideration holds for representation of the value function . The matrix *L* is represented by the connections between and . The value of many different states can be simultaneously represented. Once information is located in distributed codes, calculation of simple optimization routines such as becomes straightforward and automatic. Therefore, the neural circuitry of the motor system may be particularly well adapted to risk-aware control and spiking neural algorithms that implement it.

*u*determines the relative contribution of the component

_{i}*L*of the dynamics. Suppose that we adjust the magnitude of

_{i}*L*so that each is between zero and one. Then we can achieve the same level of control by setting at each time step with probability proportional to

_{i}*vL*. This provides an implementation in spiking neurons if we assume that each

_{i}p*u*is a motor neuron and that it fires with appropriate probability. In this case, each dynamic controller

_{i}*L*is either fully on or fully off. But if each

_{i}*u*is a Poisson-distributed random variable with rate , then we have so the average behavior is equivalent to a flexible mixture of dynamics . If there are enough neurons in a population, the average behavior will be a good approximation to the actual behavior. In this way, a group of rate-coded neurons can specify a flexible mixture of dynamics, and the rate of each neuron can be chosen so that control behaves in a risk-aware manner.

_{i}This is not the only possible way to map the firing of a neural population onto dynamics. It is possible that particular patterns of neural firing are associated with specific dynamics instead of the simple additive model in equation 2.19. In this case, we would need to consider where *u* is the full pattern of neural firing and recognize that *L* depends on the details of the pattern, including spike timing and spike coincidence. The advantage of the additive model is that each neuron can “decide” whether to fire using only local information, whereas in the full pattern model, the decision depends on knowledge and control of all other neurons in the population. Response to and selection of specific patterns of cortical activity seems to be one of the features of basal ganglia (Mink, 1996), so perhaps the basal ganglia would be responsible for identifying the correct dynamics based on cortical population patterns.

### 2.9 Learning

There are many aspects to learning that must be considered. An important issue is the sensory representation itself. For instance, if takes only a set of discrete values , then a set of *N* neurons can represent if each neuron *i* fires with a rate proportional to . Distributed coding of the probability density of state becomes very costly if there are more than a few relevant dimensions of the state or many possible values of the state. Therefore, learning efficient representations that reduce dimensionality or efficiently encode data in individual dimensions is very important for storage and to improve learning. For instance, can be represented by a low-dimensional basis vector where for tuning curves . In this case, the *i*th neuron fires with rate . may have a population representation, and for simplicity, I will assume that uses the same tuning curves as . I will not discuss sensory representation learning here but will focus instead on learning the sensory-motor transformations that are at the heart of neural control.

The main thing that must be learned for motor control is the effect of individual neurons on the dynamics. This is described by the operator *L*, which for our purposes is a matrix. We will continue to assume that *L* connects and *p*, but we note that for appropriately chosen alternative bases , there may be a different linear transformation that connects the low-dimensional representation of to the low-dimensional representation of *p*.

*L*is to observe

*p*and and use a standard Hebbian-like algorithm, such as the Widrow-Hoff least-mean-squares (LMS) algorithm (Widrow & Hoff, 1960) to learn the mapping. This is very easy because of the distributed nature of the representation of . As noted above,

*L*is linear, and therefore a linear network suffices to learn it. A simple matrix learning algorithm similar to LMS is where

*p*and are to be interpreted as vectors of firing rates of the neural populations. is the predicted value of , is the prediction error, and is the learning rate.

*L _{u}* has to be learned for each control variable

*u*so that the correct control can be chosen. This requires considerable storage, but the subsequent calculation of is then quite fast. The estimate of

*L*will improve with practice. Initially it will be zero, which means that our prior assumption about the effect of

_{u}*u*is that it does nothing. Then, as we exert the control under different conditions , we gradually learn the results and refine the estimate of

*L*. This is equivalent to learning an internal model, and it will be most accurate near values of the state that have been visited often.

_{u}In order to control the system, the value function is substituted for the observed change in state , and rather than using *L _{u}* to predict , we instead use it to predict the resulting change in value

*vL*that would result from activating

_{u}p*u*in state . Since this will result in reflex responses to perturbations, those reflex responses will be gradually modified as the dynamics

*L*are learned. It is interesting to speculate whether particular patterns of control early in learning reflect this process. For instance, if the

*L*matrix early in learning indicates significant uncertainty about the effect of control, stabilization against costly perturbations might require much higher activation levels than later in learning. This would be an explanation for high muscle tone and co-contraction during early phases of learning: they represent the appropriate response to uncertainty about the true dynamics (Osu et al., 2002; Thoroughman & Shadmehr, 1999).

## 3 Examples

The following simplified examples are meant to illustrate use of the theory. We do not perform head-to-head comparisons against existing algorithms because we do not expect to outperform existing algorithms. In particular, optimal control and optimal feedback control are in fact optimal with respect to their assumptions. Risk-aware control is an alternative formulation that may have applicability in a wider range of stochastic adaptive optimization problems, but it is primarily intended as a computational model of some aspects of biological motor control under risk.

Driving provides a good example of risk-aware control. During driving, the cost function is immediately visible since the road itself defines the safe and unsafe regions. Since the road can curve or obstacles can occur, the cost function is time varying, and the control system must respond to its variations. Although this is an optimization task since the vehicle must be steered to achieve minimum cost, it is not a Bellman-type optimization problem since the cost function is immediately visible and does not have to be derived from future reward. It is therefore a one-step look-ahead problem equivalent to stochastic feedback control.

*L*operator for the

_{i}*i*th neuron implements the function where

*f*is the force exerted by neuron

_{i}*i*and

*g*is the standard deviation of the force. In Figure 2 the vertical axis is , each vertical slice of the road is the cost function at that time point (shown in yellow), and the overlaid colors show the algorithm’s estimated probability density at that time (red is high probability, blue is low).

_{i}Figure 3 shows the effect of uncertainty. In this case, the road has two lanes. Driving on the lanes has the lowest cost, driving in the center divider has higher cost, and driving off the edges has the highest cost. In the low-noise case, the correct answer is to drive on one lane or the other. But in the high-noise case, the correct answer is to drive in the center divider in order to avoid falling off either edge of the road. Risk-aware control solves this optimization problem using its feedback law rather than a standard optimizer. In other words, feedback control where neural firing rate is proportional to automatically causes the change of behavior from driving in one lane to driving on the divider as the noise increases. The neurons and control system are the same as for Figure 2. The only difference here is the bimodal cost function due to the two possibilities on the road. Note that most linear optimization algorithms or feedback controllers would automatically drive in the center divider, even in the low-noise case.

Figure 4 shows the effect of sudden changes in the value function as well as the effect of perturbations. As in the previous figures, 200 neurons attempt to keep the vehicle on the road. Here, the value of *L _{i}* for each neuron was not assumed to be known in advance, but was learned from examples (using equation 2.33) prior to performance of the test shown in the figure. In this case, the road is much narrower than the variability, so the vehicle is frequently off the road. Nevertheless, risk-aware control causes the vehicle to track the curve in the road that occurs halfway. In Figure 4b, the spikes are shown, and we can see a change in the population of spiking neurons that reflects the movement of the vehicle to follow the road. At three-fourths of the way to the end, a perturbation suddenly pulls the vehicle off the road. The algorithm detects the deviation from optimal cost and rapidly corrects. This is seen in the “reflex” firing of a large subpopulation of the neurons that pushes the vehicle back on the road and holds it there.

## 4 Conclusion

Robots are designed to move in predictable and controlled environments. They often fail when placed in new, unknown, or varying environments. Yet humans adapt rapidly and safely to environments they have never seen before. A human is rarely injured the first time on a trampoline, skiing, or fencing. We know to take appropriate precautions, and we know automatically how to move in order to protect ourselves in new and uncertain situations.

Risk-aware control provides a new way of thinking about the relationship between the cost and execution of motor actions. The risk is determined by the interaction of cost, variability, and uncertainty. Risk is always taken into account, whether it is known during movement planning or whether it arises suddenly and unexpectedly. The sources of uncertainty include variability in the effect of actions, external perturbations, and lack of knowledge of the current state.

Risk-aware control is a feedback controller based on current estimates of the value of states. It does not solve long-term Bellman-type optimization problems. Bellman optimization has been used successfully in noisy systems, and iterative solutions that include obstacle avoidance have been proposed (Hamilton & Wolpert, 2002). Risk-aware control does not solve such problems, but it also does not require iterative or annealed solvers. One solution for long-term optimization using risk-aware control is to use a Bellman-type optimization as an initial step, so that the cost function reflects future discounted reward. Another solution is to derive stochastic operators *L* that predict future change in state , although the uncertainty of these predictions would necessarily increase and linear superposition is not guaranteed. But even without such extensions, risk-aware control solves the short-term optimization problem, and aspects of its behavior would previously have been considered to represent the properties of an optimizing controller rather than a feedback controller.

The representation of risk-aware control has several novel elements that distinguish it from standard feedback control. Rather than use scalar variables for the state, it uses functions that indicate the probability of state. This allows consideration of multiple possibilities at all times, and it maps naturally to distributed neural representations. Instead of a single reference trajectory, risk-aware control uses a time-varying cost function. This allows ambiguous or “don’t care” conditions. For example, where on the road you drive may not matter so long as you stay on the road. It also allows for multiple safe regions, perhaps separated by unsafe regions, and it allows for representation of relative cost or value of different possible paths.

The controller can be implemented in a number of different ways, but it is most natural to use the superimposed dynamic controllers described here (Sanger, 2010a, 2010b). The choice of dynamics not only automatically implements the unperturbed control path but also specifies the reflex behavior that will correct for potential perturbations. The controller includes an estimate of the likelihood of perturbations (embedded in the *L* operator), and this estimate guides the choice of control to ensure that expected perturbations are minimized. There is thus a very close link between risk-aware feedback control and tunable reflexes. As a model of biological control, the superimposed dynamic controllers reflect the fact that the force generated by muscles depends on their length and velocity and that the force resulting from activation of any particular motor neuron depends on many different factors, not all of them predictable. In standard feedback control, state affects behavior only through the feedback loop, whereas in operator superposition, state can also affect behavior by modulating the effects of output commands.

It is interesting to contrast risk-aware control with optimal feedback control (Todorov & Jordan, 2002). OFC instantaneously calculates the optimal trajectory from the current estimated state, based on a cost function that specifies the value of future states. Because of the complexity of the equations, this is typically done in a deterministic environment and has difficulty taking into account uncertainty and perturbations. OFC requires high computational power, and there is no known neural implementation. But the greatest difference is that in risk-aware control, the dynamics are preplanned, so that the response to perturbation is determined before the perturbation occurs. OFC does not plan for unexpected perturbations but instead responds to perturbations only after they happen. The effect of the two different controllers may be the same, since both are capable of responding optimally to perturbations in the presence of known cost functions. In particular, both will resist perturbation only in dimensions that significantly affect task performance. Risk-aware control implements combinations of reflexes that anticipate possible future perturbation or uncertainty. This allows risk-aware control to take advantage of the natural properties of muscles in order to alter stiffness and viscosity ahead of time by cocontracting or otherwise predictively changing the impedance. But because OFC can solve Bellman-type optimization problems, it can generate much more complex responses that are associated with higher-order task goals. This avoids the potential for failure of risk-aware control that can occur when the short-term value of states is a poor predictor of long-term value, either because it is incorrect or because there is insufficient knowledge to estimate current value. It seems likely that some combination of preplanned anticipatory reflexes and online reoptimization is used to generate flexible human motor behavior.

*u*to be the inner product of the value function and the predicted change in state . This can be interpreted as trying to select commands that make the predicted change in state most closely match the value function. Therefore the value function can be interpreted as the desired change in state, which closely matches the concept behind active inference. Furthermore, in the special case where the set of controllers is such that

_{i}*L*and

_{i}p*L*are orthogonal for , are normalized, and span the full space, then the predicted response to the feedback control is so the learning rule 2.33 is equivalent to which shows that learning reaches equilibrium when the value function (or desired (predicted) change in probability) is equal to the actual change in probability. This is again very similar to the principle in active inference, in which learning of the dynamics (forward model) depends on the difference between the predicted and actual outcomes. Thus, active inference generalizes some of the ideas presented here, and it allows the results here to be cast as an inference problem, in which the goal is to match the predicted to the actual change in density over states. It also provides an alternative neural mechanism (Shipp, Adams, & Friston, 2013).

_{j}pRisk-aware control allows for safe behavior in an unpredictable environment. An important prediction of the theory is that behavior will be modified by perceived risk even if failure has not yet been experienced. This is particularly important for failures with permanent or life-threatening consequences; such failures must be avoided, and this avoidance cannot occur by learning through experience. It is interesting to speculate that the natural variability in human movement may be a way that our bodies allow us to experience perturbations that the environment has not yet applied or that might occur only very rarely. This would allow us to learn the effect of perturbations while not having to experience them in a truly dangerous environment. This might be particularly important for infant learning; although the infant is in a completely safe (and generally unperturbed) environment, it is nevertheless important for him or her to learn how to handle the less predictable natural environment that he or she will soon occupy.

Awareness of risk guides all of our actions, and this is essential for our survival in an unpredictable and potentially hostile environment. Flexibility in the face of a changing environment is characteristic of humans but not robots. Risk-aware control provides a computational model for a nonlinear stochastic feedback controller with a neural implementation that mimics the flexibility and responsiveness of human motor behavior under uncertainty.

## Acknowledgments

Support for this project was provided by the National Institutes of Neurological Disorders and Stroke (NS069214), and the James S. McDonnell Foundation.