Abstract

Human movement differs from robot control because of its flexibility in unknown environments, robustness to perturbation, and tolerance of unknown parameters and unpredictable variability. We propose a new theory, risk-aware control, in which movement is governed by estimates of risk based on uncertainty about the current state and knowledge of the cost of errors. We demonstrate the existence of a feedback control law that implements risk-aware control and show that this control law can be directly implemented by populations of spiking neurons. Simulated examples of risk-aware control for time-varying cost functions as well as learning of unknown dynamics in a stochastic risky environment are provided.

1  Introduction

All movement by humans and animals is performed in a risky environment with mostly unknown and unpredictable dynamics. Awareness, prediction, and avoidance of risk are so fundamental to natural movement that we expect our movement to change when, for instance, we carry a cup of hot coffee, walk near a cliff, carry a baby, or move our arm between glassware on a table. If we observe another person whose movement does not change appropriately in response to risk, we consider him or her to be a risk taker, clumsy, or insensitive to his or her environment. Yet despite the fundamental importance of risk awareness for natural movement, most current theories of human motor control address this only minimally, if at all.

We use the term risk to indicate the expected cost of behavior. For example, in certain cases, risk is the product of the probability of error and the cost of error. It depends on the coincidence of high-cost states and uncontrollable or unpredictable components of movement (which might include noise, perturbations, or unknown dynamics). Successful avoidance of high-cost states or uncontrolled dynamics near low-cost states do not lead to high risk. Natural response to anticipated risk involves (1) choosing a path that avoids high-risk regions, (2) setting feedback or reflex responses to compensate for unexpected deviations toward high-risk regions, and (3) reducing movement variability and uncertainty.

Although risk is only rarely addressed in theories of human motor control, there is a large literature on risk within the field of optimal control theory. Risk (in the form of a cost function and noise model) is used for planning an optimal trajectory, but execution of that trajectory is then achieved using a fixed feedback controller (Chow, 1976; Davis & Vinter, 1985). This means that during movement, two different perturbations of equal magnitude away from the optimal trajectory will be pushed back toward that trajectory without considering whether there is a difference in the risk caused by those perturbations. Risk-sensitive optimal control (Nagai, 1996; Whittle, 1990) introduces an exponential nonlinearity in the cost function that allows optimization relative to risk-averse or risk-seeking behavior.

But the goal during execution remains control of a precalculated optimal trajectory. This structure results from simplifying assumptions including local linearity, additive gaussian noise, and certainty equivalence, meaning that the system behaves as if its best estimate of the current state is in fact the true current state. These assumptions are problematic as models of human movement. For example, there are many situations in which there is no single desired path (e.g., one might drive anywhere on a wide road), the noise is neither additive nor gaussian (e.g., quantized noise or action-dependent—multiplicative—noise), dynamics are nonlinear (e.g., muscle-like actuators and biological proprioceptors), and certainty equivalence does not hold (e.g., behavior in high-risk environments depends on the amount of sensory uncertainty).

More recently, optimal feedback control theory (Todorov & Jordan, 2002) reestimates the optimal trajectory at each time point, and by so doing, it automatically compensates for perturbations. A perturbation away from the desired trajectory may result in a corrective response toward the goal rather than back to the original trajectory. The computational complexity of continual reoptimization requires many simplifying assumptions, and current implementations do not take uncertainty or complicated cost functions into account. In particular, there is no change in behavior if the current state is uncertain or if there is increased variability or uncertainty in the results of actions. However, a full implementation of optimal feedback control (OFC; perhaps with precalculated perturbation responses) will be risk aware if it incorporates state uncertainty, stochastic dynamics, and arbitrary cost functions or prior beliefs about the value of outcomes.

None of the classical theories accounts for the common observation that muscle tone changes prior to an expected perturbation even if the perturbation has never previously occurred (Bouisset & Do, 2008; Massion, 1992). Reflexes are adjusted based on predicted risk so that corrective movements are preplanned. This human behavior differs from OFC (Todorov & Jordan, 2002) because in OFC, the response to perturbation is computed only after the perturbation has occurred. It can be thought of as “planning for errors” and is an obvious and necessary component of successful human behavior and motor performance.

Awareness, estimation, and responsiveness to risk improve during development. Every parent is familiar with a child placing a cup too close to the edge of a table, gripping a sharp object improperly, or not fixing an untied shoe. When challenged, the child will respond that nothing has gone wrong yet, so there is no reason to worry. But the parent, with greater experience, is aware of the risks, including the unlikely but nonzero probability of a disastrous outcome, and therefore will protect or guide the child to reduce the risk. Children must learn to estimate risk, including the probability of perturbation, the cost of unlikely errors, and the planning and reflexes necessary to reduce costs for rare perturbations or unexpected uncertainty.

Differences in risk sensitivity explain some of the differences in behavior between different humans. Estimation of risk involves estimation of probability of error and cost of error. Two people may differ in their assessment of either of these and therefore will differ in their planning behavior. For instance, differences in estimation of the probability of an unusually large wave (a “rogue wave”) may cause different people to sit nearer or farther from the water at the beach. Differences in the estimated cost of a broken leg may cause different people to choose whether to ski. Estimation is likely to be particularly difficult for rare events, since there will not be enough examples to calculate the true probability of occurrence or cost. Nevertheless, compensation for rare but potentially devastating events is an important element of survival, since fatal events do not permit acquisition of multiple samples for estimation of statistics.

Therefore, any complete understanding of human motor function must take into account risk awareness. In order to do so, we need a mathematical theory that can describe the interaction between risk, uncertainty, planning, and the response to perturbation. We have previously described the framework of likelihood operators (Sanger, 2010a, 2011) and here claim that this framework provides direct support to understanding and investigating risk-aware behavior in humans and animals. Furthermore, this framework has a direct link to neural implementations, so that it may predict not only behavior but also the internal representation and computation underlying that behavior. In the following sections, we describe the mathematical framework and show some examples of applications to understanding human behavior.

2  Theory

At least three elements contribute to risk: uncertainty about current state, uncertainty about the effect of actions, and cost of errors.

2.1  State Uncertainty

We represent state by a scalar or vector and the uncertainty about state by a probability distribution representing our relative belief in each different possible value of the state. is not the true probability of occurrence of the state, but rather our belief in the probability of occurrence. This belief is usually based on a set of prior measurements s, and thus can be interpreted as , the conditional probability of state given all prior available information. State uncertainty may be due to random effects such as unmeasured noise in the observations or to deterministic effects such as quantized, band-limited, or intermittent observations.

2.2  Action Uncertainty

We represent possible actions by a vector u, and if we choose a particular action at a particular instant of time, it causes a change in state that depends on the choice of action and the current state:
formula
2.1
Since we do not know the state exactly but instead know only an estimate of the probability of state, we can use the Kramers-Moyal expansion (see Gardiner, 1985) to obtain a high-dimensional linear system with equivalent stochastic dynamics,
formula
2.2
where is a linear operator that depends on the choice of u. (A well-known example of such an operator L for certain physical systems is the Fokker-Planck equation, also known as the forward Kolmogorov equation). If takes on only a discrete set of values, then is a matrix. If can take continuous values, then is an operator, and we can write equation 2.2 as
formula
2.3
It may seem surprising that for any nonlinear system 2.1, the dynamics of are always described by a linear equation of the form of equation 2.2. This fact becomes more intuitive if we consider the deterministic case in which is known so that is a Dirac delta function. Then the “column” can specify any desired output function .
In order for equation 2.2 to preserve the fundamental property of probabilities (), the sum of each column of must be zero. In the continuous case,
formula
2.4
In order to preserve nonnegative probabilities, we also must have nonnegative off-diagonal elements so whenever . Therefore, the diagonal elements L(, ) must always be nonpositive, meaning that probabilities cannot increase themselves; they can only increase due to the “flow” of values from other places, and they can only decrease by “flowing” elsewhere. In the discrete (matrix) case, this means for , and .

Note also that L indicates the uncertainty in the effect of action. Even if we have exact knowledge of before we choose action u, the result will usually not permit exact knowledge of after the action. Mathematically, this means that even if p(, ) is a Dirac delta function that is nonzero only for a single value of , application of the linear operator will usually not be a delta function, and so there will be increases in the probability of many different values of .

As an example, consider the action that pushes a scalar state to the left at a rate of units per second. Then the probability of any value of after moving for time looks like the probability of before. We can write this as
formula
2.5
or as a differential equation,
formula
2.6
which we can write as if we identify L as the differential operator:
formula
2.7
This is easier to understand in the matrix case. Suppose that can take on discrete values from . Then is an N-vector, and L is an matrix. The “push-left” operator as a matrix is
formula
2.8
which means that each element decreases proportional to its own value while increasing proportional to the value to its right, until it is replaced by the value to its right. In general, you can produce many different types of operators this way. Consider a matrix operator that attracts toward a stable central region:
formula
2.9
Discrete-time probability operators must satisfy and , so the requirements on the L operators in these examples are that the column sums must be zero and all off-diagonal elements must be nonnegative.

If the underlying state has more than one dimension, then the operators L are tensors, oscillatory or chaotic behavior is possible, but the same linear equation describing the stochastic dynamics continues to hold. It is important to realize that such behavior may lead to nonzero curl in the vector field describing the dynamics, and thus such dynamics cannot be the solution of an optimization problem described by a time-invariant cost function. In our case, such behavior would be desirable if the selected dynamics match the desired dynamics or if the time-varying cost function is itself oscillatory. This is a significant difference from standard optimal control (in which solutions must be curl free since they are derived from time-invariant or quasi-static cost functions), but similar to solutions obtained using active inference (Friston, 2011).

2.3  Cost of Errors

We represent the cost of errors using a value function , where more negative values of mean that a state has a high cost or is undesirable and positive values of indicate that a state is desirable or rewarded in some way. represents our belief in the cost of each state , not necessarily the true cost (which might be difficult or impossible to estimate for rare states). Since we are uncertain of our current state, we know only , so the expected cost is the scalar value
formula
2.10
We could potentially represent our uncertainty about cost using some other probability distribution p(, ), but since the total expected cost depends only on the average cost for each state, we would not gain anything by the added complexity.
Now, we choose a particular action u and ask what the expected change in cost will be due to that action (where we are uncertain about the state we start in and the effect of the action). As Sanger (2010b) showed, we can write
formula
2.11
formula
2.12
formula
2.13
which we can write more succinctly if and p are vectors and L is a matrix as
formula
2.14

2.4  Feedback Control

This suggests that if we choose to maximize at each point in time, then this is the best way to reduce cost or increase reward. Clearly this does not address the problems that are resolved with optimal control theory in which the expected reward over future states is optimized. For now, we consider only the one-step-ahead optimization (or we assume that the cost function already includes expected discounted future value of each state), and we set
formula
2.15
For example, suppose that u is really a binary vector with only a single nonzero element that allows us to choose one discrete action at a time. Then we can write
formula
2.16
where Li is the value of when the ith element of u is 1 and all other elements are 0. The expected cost (value) rate is
formula
2.17
and we can maximize by choosing the control ui for which vLip is greatest. For each control ui, vLip is the expected change in value if that control were to be activated, so this rule simply says to choose the control that is most likely to cause the greatest increase in expected value (or greatest decrease in cost).
In general, maximization of can be achieved by finding values of u for which
formula
2.18
A more general control model might be the linear superposition (mixture of flow modes),
formula
2.19
where the ui’s are constrained so that the total output power . Then the optimal solution is to activate each control according to
formula
2.20
so that each control output is proportional to how much its activation will improve the value of the state. The overall structure is shown in Figure 1.
Figure 1:

Structure of the feedback controller. The overall structure is similar to a classical feedback controller, but the control law is rather than the standard linear proportional law .

Figure 1:

Structure of the feedback controller. The overall structure is similar to a classical feedback controller, but the control law is rather than the standard linear proportional law .

If we use equation 2.20 in a control law, this is a feedback controller, but not in the usual sense. In standard feedback control, there is a single desired state or a reference trajectory , and the goal is to minimize the distance (usually mean-squared distance) from the desired state at each point in time. This works for linear systems because of the principle of “certainty equivalence”; for linear systems, the optimal control law can be derived knowing only the expected value of the state , and it does not depend on whether one is uncertain of the value of . The error is given by the difference between the desired state and the expected value of the true state , and the control law is usually a linear function of the difference:
formula
2.21

For nonlinear systems such as equation 2.1, certainty equivalence does not hold, so the standard optimal solutions are incorrect. More important, in the absence of certainty equivalence, we need to keep track not just of the state but also of the complete probability of state p(, ). This is why our formulation is particularly useful. We use the estimated probability p(, ) as our “state” at each point in time. Because of this, we have the added benefit that the system dynamics can be described as a linear system 2, even though the underlying dynamics (see equation 2.1) are nonlinear.

Now, instead of a reference trajectory , we use a time-varying value function (, ), and instead of the state , we use the time-varying probability density of state p(, ). The value function specifies not a single desired state but the relative value of every possible state. It is certainly possible to specify a single desired state with a mean-squared cost of error by setting , but this is only one possibility and using (, ) gives much greater flexibility. For instance, (, ) could indicate that the goal is to remain on a road, but where exactly on the road does not matter so long as we do not drive off the edge. In this case, (, ) might have a constant value of zero on the road but a large negative value (or cost) everywhere else (see Friston & Ao, 2012, for a similar example). There is no need to specify a particular path down the road; all places on the road could have equal value. There is no need to waste energy trying to resist perturbations that do not push us off the road. By using time-varying value functions and state probability densities, we obtain a feedback controller that takes into account state uncertainty, control uncertainty, and a general cost function.

In this discussion, we have not taken control cost into account, but for certain types of control cost, this is a straightforward extension. For example, suppose that associated with each control ui there is a cost proportional to the magnitude of control . Then the total rate of change in value is
formula
2.22
and we can maximize this by setting each control ui proportional to .

Stability for this class of feedback controller is determined not by a feedback gain but by the properties of the operators Li. For example, an operator for a first-order system will be stable if it generates a flow toward a single stable equilibrium value of the state, and unstable otherwise. An operator for a second-order system may implement a stable oscillation (limit cycle) or convergence to one or more stable equilibria. Linear combination of stable operators will always be stable, and linear combinations that include unstable operators may or may not be stable. When the combination is determined by a value-driven feedback law as in equation 2.20, then if the value function has a peak, operators will be selected that stabilize near that peak. However, if the value function is primarily a valley (a “keep-away” zone), then unstable operators may be chosen if they avoid the valley. In general, specification of a value function with a finite set of maxima will cause selection of operators that stabilize near these maxima.

2.5  Reflexes

Suppose we have implemented a risk-aware feedback control law of the form of equation 2.15. What effect does this have on the response to perturbations? It is often easiest to understand this by initially examining a classical feedback controller as in equation 2.21: . The response to an increase in the reference position is the same as the response to a decrease in the estimated state . In other words, a perturbation of the state in the negative direction will lead to compensation in the opposite direction, proportional to the gain constant K.

The same is true for the nonlinear feedback control law , except that here, the value function (, ) determines the effective gain. Consider the scalar feedback controller from equation 2.20. A small rightward perturbation in can be described as
formula
2.23
(To see this, note that everywhere has positive slope, it will decrease, and everywhere it has negative slope, it will increase. The overall effect is to make the value of at each point look a little more like the value to its left, so will appear to slide to its right.) The effect of the rightward perturbation on the feedback will be
formula
2.24
Now suppose that the effect of activating u is to push to the left. In this case, the operator L can be written as so that activation of u in the absence of perturbation would cause
formula
2.25
Now we can write
formula
2.26
The second derivative of p is low at the peaks and high at the valleys. So whenever a peak lines up with a high value or a valley lines up with a low value , will be high. In other words, if the current state has high overall value (high values occur with high probability ), u will increase. Conversely, if the current state has low overall value, u will decrease. Since u pushes left while the perturbation pushes right, this means that u will be activated to resist the perturbation only if doing so maintains a high value state. The amount of activation of u will be proportional to how good the current state is, the size of the perturbation, how certain we are about the current state (high certainty means sharp peaks and broad valleys of ), how rapidly the value changes with state, and the effect of activating u on restoring the state.

This demonstrates a very important property of reflex behavior in risk-aware control: only perturbations that decrease value are resisted. The purpose is not to stabilize the system but to increase value or prevent decrease in value. If a perturbation causes an unexpected increase in value, you certainly do not want to resist it. For instance, the controller will be very resistant to a perturbation that causes you to fall off a cliff, less resistant to a perturbation that causes you to gently brush a wall, zero for a perturbation that does not affect your progress, and negative (meaning it assists the perturbation) for a perturbation that increases safety or pushes you in the direction you want to go.

Of course, since the value function (, ) is time varying, the feedback control tends to keep you in high-value regions for the current value function. For example, if the purpose is to move slowly along a path toward a final goal and this is implemented by shifting the peak of the value function slowly along that path, then a perturbation that occurs in the middle of the path will result in restoring control toward the current peak of the value function, not toward the final goal. (This is the risk-aware version of equilibrium-point control (Bizzi, Accornero, Chapple, & Hogan, 1984).) Of course, movement could also be achieved by having a constant with a peak at the end point, but this would be appropriate only if the timing of movement or the time of arrival at the end point were unimportant to the success of the task. In general, the formulation proposed here allows a very flexible form of compliant control, in which effort is expended to stabilize movements only when stabilization is important to task performance.

2.6  Optimization

Note that this formulation does not solve classical optimal control problems. In optimal control, the cost is specified at some future time Tf and the goal is to find a sequence of controls that achieves the best possible future outcome. Such problems are usually solved using some variation of the Hamilton-Jacobi-Bellman equation (Bellman, 1957), which generally works by estimating the current cost (, ) in terms of the final cost and the best guess as to the optimal future controls . Here, we assume that the current cost is known. We do not include any estimate of the relative cost of actions, so that all actions u are equally easy to achieve and differ only in their effects on probability of state .

The basic elements of optimal control can be stated in terms of risk-aware nonlinear control laws by noting that we can write the expected value of a control using the inner product,
formula
2.27
since the value function and the change in probability of state are both either vectors or functions (in an inner-product space). As noted above, the goal is then to choose the sequence of controls in order to maximize the increase in value at every time step. Therefore we want
formula
2.28
at all times . This is a simplified version of the Pontryagin maximum principle (Vincent & Grantham, 1999) in which the optimal solution occurs when the state is orthogonal to the costate at all times. In our formulation, represents the costate (cost), but we assume that this is already known at each point in time.

Therefore, the control law in equation 2.15 is just that: a control law, not an optimizer. It is worthwhile to contrast the behavior of the control law used with a cost function to general stochastic optimization. As noted above, the goal of an optimizer is to estimate the current value function () in terms of the value at some future time, taking into account the possible cost of movement and in some cases the probability of errors or uncertainty in the effect of actions . In almost all cases, it is assumed that the true state is known exactly, although if it is not, then the value can be identified with the expected value (over p(, )) of . Solving the optimization problem in this way requires knowledge of future uncertainty in both state and the effects of actions. This is usually solved only for uncertainty due to additive gaussian noise of known variance, which limits its applicability since uncertainty is frequently time varying.

In contrast, the control law here assumes that (, ) is known but may have been computed without taking uncertainty into account. Uncertainty is incorporated in the control law through the use of p(, ) (which represents uncertainty in the current state) and the stochastic dynamics L (which represent uncertainty in the effect of actions for each state).

A full implementation of OFC (Todorov & Jordan, 2002) will have risk-aware behavior. For example, if a Bellman-type rule is used to precalculate the future discounted cost function (, ) at each point in time, then OFC will force the state to descend the local gradient of (, ), which is essentially the same as for the control law proposed here. However, most current implementations of OFC do not take control uncertainty, model uncertainty, or state uncertainty into account, so they may not plan appropriately risk-aware actions.

Furthermore, there is an essential difference in implementation. Risk-aware control suggests that the dynamics be selected in order to minimize the expected cost of perturbations or uncertainty. While this may be done by a feedback rule, it can also be done by changing the dynamics or mechanical properties themselves (e.g., by co-contraction of muscles or selection of a posture that resists perturbations in certain directions). In contrast, OFC depends on high-speed feedback control with instantaneous sensory estimates and actuator responses.

2.7  Computational Burden

In standard feedback controllers, state is represented by a vector of variables (usually high-precision floating-point numbers), and implementing the feedback gain requires a matrix multiplication where the size of the matrix is equal to the number of variables. In risk-aware control, state is represented by the complete joint density over all state variables, and the feedback gain requires a separate matrix multiplication for every control variable, where the size of the matrix is exponential in the number of control variables. Therefore, the computational burden is significantly higher, and full implementation is untenable for high-dimensional state space.

However, the operations required for risk-aware control can be easily parallelized, and the benefit of the higher complexity is that iterative optimization is unnecessary so long as the one-step-ahead cost function is available. Furthermore, response to perturbations is effectively precomputed, providing very rapid response to perturbations whenever they will have a significant effect on task performance.

Approximate methods can be used to speed up the calculations. For instance, the probability densities can be projected onto lower-dimensional spaces, similar to the Galerkin method (Galerkin, 1915) and finite-element approximations. Alternatively, since vLip is the expected change in value from activating control ui, a neural network approximation could be used to learn the relation between , p, and ui that increases value. The matrix operators Li are usually very close to diagonal, so the matrix multiply operation can be approximated by a convolution in many cases.

2.8  Spiking Distributed Neural Networks

If risk-aware control is a model for biological movement under risk, it is interesting to speculate whether a neural-like implementation exists or even whether a neural-like implementation might provide increased efficiency compared with standard computational hardware. In this section, we explore whether populations of rate-coded (Poisson) spiking neurons can be used to represent dynamics under the risk-aware control framework.

The overall framework of risk-aware control maps nicely onto distributed neural representations. Because state is represented as a probability density rather than a value, it is most naturally expressed as a distributed representation in which different neurons code for the probability that the state is in different conditions (Ma, Beck, Latham, & Pouget, 2006; Pouget, Dayan, & Zemel, 2000; Sanger, 1998). The state itself is implicitly coded by which neurons are firing the most (indicating the highest probability of being in a particular state). A similar consideration holds for representation of the value function . The matrix L is represented by the connections between and . The value of many different states can be simultaneously represented. Once information is located in distributed codes, calculation of simple optimization routines such as becomes straightforward and automatic. Therefore, the neural circuitry of the motor system may be particularly well adapted to risk-aware control and spiking neural algorithms that implement it.

To make this precise, consider the linear superposition model of the dynamics given by equation 2.19. In this model, the value of ui determines the relative contribution of the component Li of the dynamics. Suppose that we adjust the magnitude of Li so that each is between zero and one. Then we can achieve the same level of control by setting at each time step with probability proportional to vLip. This provides an implementation in spiking neurons if we assume that each ui is a motor neuron and that it fires with appropriate probability. In this case, each dynamic controller Li is either fully on or fully off. But if each ui is a Poisson-distributed random variable with rate , then we have
formula
2.29
formula
2.30
formula
2.31
so the average behavior is equivalent to a flexible mixture of dynamics . If there are enough neurons in a population, the average behavior will be a good approximation to the actual behavior. In this way, a group of rate-coded neurons can specify a flexible mixture of dynamics, and the rate of each neuron can be chosen so that control behaves in a risk-aware manner.

This is not the only possible way to map the firing of a neural population onto dynamics. It is possible that particular patterns of neural firing are associated with specific dynamics instead of the simple additive model in equation 2.19. In this case, we would need to consider where u is the full pattern of neural firing and recognize that L depends on the details of the pattern, including spike timing and spike coincidence. The advantage of the additive model is that each neuron can “decide” whether to fire using only local information, whereas in the full pattern model, the decision depends on knowledge and control of all other neurons in the population. Response to and selection of specific patterns of cortical activity seems to be one of the features of basal ganglia (Mink, 1996), so perhaps the basal ganglia would be responsible for identifying the correct dynamics based on cortical population patterns.

2.9  Learning

There are many aspects to learning that must be considered. An important issue is the sensory representation itself. For instance, if takes only a set of discrete values , then a set of N neurons can represent if each neuron i fires with a rate proportional to . Distributed coding of the probability density of state becomes very costly if there are more than a few relevant dimensions of the state or many possible values of the state. Therefore, learning efficient representations that reduce dimensionality or efficiently encode data in individual dimensions is very important for storage and to improve learning. For instance, can be represented by a low-dimensional basis vector where for tuning curves . In this case, the ith neuron fires with rate . may have a population representation, and for simplicity, I will assume that uses the same tuning curves as . I will not discuss sensory representation learning here but will focus instead on learning the sensory-motor transformations that are at the heart of neural control.

The main thing that must be learned for motor control is the effect of individual neurons on the dynamics. This is described by the operator L, which for our purposes is a matrix. We will continue to assume that L connects and p, but we note that for appropriately chosen alternative bases , there may be a different linear transformation that connects the low-dimensional representation of to the low-dimensional representation of p.

Since , the easiest way to learn L is to observe p and and use a standard Hebbian-like algorithm, such as the Widrow-Hoff least-mean-squares (LMS) algorithm (Widrow & Hoff, 1960) to learn the mapping. This is very easy because of the distributed nature of the representation of . As noted above, L is linear, and therefore a linear network suffices to learn it. A simple matrix learning algorithm similar to LMS is
formula
2.32
formula
2.33
where p and are to be interpreted as vectors of firing rates of the neural populations. is the predicted value of , is the prediction error, and is the learning rate.

Lu has to be learned for each control variable u so that the correct control can be chosen. This requires considerable storage, but the subsequent calculation of is then quite fast. The estimate of Lu will improve with practice. Initially it will be zero, which means that our prior assumption about the effect of u is that it does nothing. Then, as we exert the control under different conditions , we gradually learn the results and refine the estimate of Lu. This is equivalent to learning an internal model, and it will be most accurate near values of the state that have been visited often.

In order to control the system, the value function is substituted for the observed change in state , and rather than using Lu to predict , we instead use it to predict the resulting change in value vLup that would result from activating u in state . Since this will result in reflex responses to perturbations, those reflex responses will be gradually modified as the dynamics L are learned. It is interesting to speculate whether particular patterns of control early in learning reflect this process. For instance, if the L matrix early in learning indicates significant uncertainty about the effect of control, stabilization against costly perturbations might require much higher activation levels than later in learning. This would be an explanation for high muscle tone and co-contraction during early phases of learning: they represent the appropriate response to uncertainty about the true dynamics (Osu et al., 2002; Thoroughman & Shadmehr, 1999).

3  Examples

The following simplified examples are meant to illustrate use of the theory. We do not perform head-to-head comparisons against existing algorithms because we do not expect to outperform existing algorithms. In particular, optimal control and optimal feedback control are in fact optimal with respect to their assumptions. Risk-aware control is an alternative formulation that may have applicability in a wider range of stochastic adaptive optimization problems, but it is primarily intended as a computational model of some aspects of biological motor control under risk.

Driving provides a good example of risk-aware control. During driving, the cost function is immediately visible since the road itself defines the safe and unsafe regions. Since the road can curve or obstacles can occur, the cost function is time varying, and the control system must respond to its variations. Although this is an optimization task since the vehicle must be steered to achieve minimum cost, it is not a Bellman-type optimization problem since the cost function is immediately visible and does not have to be derived from future reward. It is therefore a one-step look-ahead problem equivalent to stochastic feedback control.

Figure 2 shows an example of following a curved road. All points on the road have equal value, and therefore the vehicle can drive anywhere on the road. Since the vehicle has inertia, it tends to drift toward the side of the road before corrective control is applied. The controller uses 200 neurons, half of which push the vehicle to the left, the other half to the right. Each neuron exerts a brief force pulse, and all neurons have different randomly chosen strengths. In addition, noise is added to each force pulse so that the result of firing a neuron is only partly predictable. Therefore, the Li operator for the ith neuron implements the function
formula
3.1
where fi is the force exerted by neuron i and gi is the standard deviation of the force. In Figure 2 the vertical axis is , each vertical slice of the road is the cost function at that time point (shown in yellow), and the overlaid colors show the algorithm’s estimated probability density at that time (red is high probability, blue is low).
Figure 2:

Simulation of risk-aware control for driving on a curved road. State is the vertical axis, and time increases to the right. Cost function is a single-lane road shown in yellow. The value of p(, ) that reflects the controller and the estimate of state is shown as a colored overlay heat map, with lower regions in blue and higher regions in red. Note that the controller remains on the road but does not attempt to stay in the center.

Figure 2:

Simulation of risk-aware control for driving on a curved road. State is the vertical axis, and time increases to the right. Cost function is a single-lane road shown in yellow. The value of p(, ) that reflects the controller and the estimate of state is shown as a colored overlay heat map, with lower regions in blue and higher regions in red. Note that the controller remains on the road but does not attempt to stay in the center.

Figure 3 shows the effect of uncertainty. In this case, the road has two lanes. Driving on the lanes has the lowest cost, driving in the center divider has higher cost, and driving off the edges has the highest cost. In the low-noise case, the correct answer is to drive on one lane or the other. But in the high-noise case, the correct answer is to drive in the center divider in order to avoid falling off either edge of the road. Risk-aware control solves this optimization problem using its feedback law rather than a standard optimizer. In other words, feedback control where neural firing rate is proportional to automatically causes the change of behavior from driving in one lane to driving on the divider as the noise increases. The neurons and control system are the same as for Figure 2. The only difference here is the bimodal cost function due to the two possibilities on the road. Note that most linear optimization algorithms or feedback controllers would automatically drive in the center divider, even in the low-noise case.

Figure 3:

Simulation of risk-Aware control for driving on a two-lane road. State is the left-right axis, and time increases into the page. Cost function is a two-lane road shown in yellow and green, and the same at all points in time. The surface shows the value of p(, ) that reflects the controller and the estimate of state. (a) Controller drives on one side of the road in the low-noise condition. (b) Controller drives in the center divider in the high-noise condition.

Figure 3:

Simulation of risk-Aware control for driving on a two-lane road. State is the left-right axis, and time increases into the page. Cost function is a two-lane road shown in yellow and green, and the same at all points in time. The surface shows the value of p(, ) that reflects the controller and the estimate of state. (a) Controller drives on one side of the road in the low-noise condition. (b) Controller drives in the center divider in the high-noise condition.

Figure 4 shows the effect of sudden changes in the value function as well as the effect of perturbations. As in the previous figures, 200 neurons attempt to keep the vehicle on the road. Here, the value of Li for each neuron was not assumed to be known in advance, but was learned from examples (using equation 2.33) prior to performance of the test shown in the figure. In this case, the road is much narrower than the variability, so the vehicle is frequently off the road. Nevertheless, risk-aware control causes the vehicle to track the curve in the road that occurs halfway. In Figure 4b, the spikes are shown, and we can see a change in the population of spiking neurons that reflects the movement of the vehicle to follow the road. At three-fourths of the way to the end, a perturbation suddenly pulls the vehicle off the road. The algorithm detects the deviation from optimal cost and rapidly corrects. This is seen in the “reflex” firing of a large subpopulation of the neurons that pushes the vehicle back on the road and holds it there.

Figure 4:

Simulation of risk-aware control following learning. (a) State is the vertical axis, and time increases to the right. Cost function is a yellow road that requires a sudden shift. The colored underlay shows the value of p(, ) that reflects the controller and the estimate of state. There is an external perturbation that occurs at three-fourths of the maximum time. (b) Neural control. Time increases to the right, and each row shows the firing times (as black dots) of a different neuron. To follow the turn in the road, some neurons turn off, while others turn on. At the time of the perturbation, the neural response reflects the perturbation and causes the reflex corrective movement. The Li matrix for each neuron was learned from examples prior to its use for control.

Figure 4:

Simulation of risk-aware control following learning. (a) State is the vertical axis, and time increases to the right. Cost function is a yellow road that requires a sudden shift. The colored underlay shows the value of p(, ) that reflects the controller and the estimate of state. There is an external perturbation that occurs at three-fourths of the maximum time. (b) Neural control. Time increases to the right, and each row shows the firing times (as black dots) of a different neuron. To follow the turn in the road, some neurons turn off, while others turn on. At the time of the perturbation, the neural response reflects the perturbation and causes the reflex corrective movement. The Li matrix for each neuron was learned from examples prior to its use for control.

4  Conclusion

Robots are designed to move in predictable and controlled environments. They often fail when placed in new, unknown, or varying environments. Yet humans adapt rapidly and safely to environments they have never seen before. A human is rarely injured the first time on a trampoline, skiing, or fencing. We know to take appropriate precautions, and we know automatically how to move in order to protect ourselves in new and uncertain situations.

Risk-aware control provides a new way of thinking about the relationship between the cost and execution of motor actions. The risk is determined by the interaction of cost, variability, and uncertainty. Risk is always taken into account, whether it is known during movement planning or whether it arises suddenly and unexpectedly. The sources of uncertainty include variability in the effect of actions, external perturbations, and lack of knowledge of the current state.

Risk-aware control is a feedback controller based on current estimates of the value of states. It does not solve long-term Bellman-type optimization problems. Bellman optimization has been used successfully in noisy systems, and iterative solutions that include obstacle avoidance have been proposed (Hamilton & Wolpert, 2002). Risk-aware control does not solve such problems, but it also does not require iterative or annealed solvers. One solution for long-term optimization using risk-aware control is to use a Bellman-type optimization as an initial step, so that the cost function reflects future discounted reward. Another solution is to derive stochastic operators L that predict future change in state , although the uncertainty of these predictions would necessarily increase and linear superposition is not guaranteed. But even without such extensions, risk-aware control solves the short-term optimization problem, and aspects of its behavior would previously have been considered to represent the properties of an optimizing controller rather than a feedback controller.

The representation of risk-aware control has several novel elements that distinguish it from standard feedback control. Rather than use scalar variables for the state, it uses functions that indicate the probability of state. This allows consideration of multiple possibilities at all times, and it maps naturally to distributed neural representations. Instead of a single reference trajectory, risk-aware control uses a time-varying cost function. This allows ambiguous or “don’t care” conditions. For example, where on the road you drive may not matter so long as you stay on the road. It also allows for multiple safe regions, perhaps separated by unsafe regions, and it allows for representation of relative cost or value of different possible paths.

The controller can be implemented in a number of different ways, but it is most natural to use the superimposed dynamic controllers described here (Sanger, 2010a, 2010b). The choice of dynamics not only automatically implements the unperturbed control path but also specifies the reflex behavior that will correct for potential perturbations. The controller includes an estimate of the likelihood of perturbations (embedded in the L operator), and this estimate guides the choice of control to ensure that expected perturbations are minimized. There is thus a very close link between risk-aware feedback control and tunable reflexes. As a model of biological control, the superimposed dynamic controllers reflect the fact that the force generated by muscles depends on their length and velocity and that the force resulting from activation of any particular motor neuron depends on many different factors, not all of them predictable. In standard feedback control, state affects behavior only through the feedback loop, whereas in operator superposition, state can also affect behavior by modulating the effects of output commands.

It is interesting to contrast risk-aware control with optimal feedback control (Todorov & Jordan, 2002). OFC instantaneously calculates the optimal trajectory from the current estimated state, based on a cost function that specifies the value of future states. Because of the complexity of the equations, this is typically done in a deterministic environment and has difficulty taking into account uncertainty and perturbations. OFC requires high computational power, and there is no known neural implementation. But the greatest difference is that in risk-aware control, the dynamics are preplanned, so that the response to perturbation is determined before the perturbation occurs. OFC does not plan for unexpected perturbations but instead responds to perturbations only after they happen. The effect of the two different controllers may be the same, since both are capable of responding optimally to perturbations in the presence of known cost functions. In particular, both will resist perturbation only in dimensions that significantly affect task performance. Risk-aware control implements combinations of reflexes that anticipate possible future perturbation or uncertainty. This allows risk-aware control to take advantage of the natural properties of muscles in order to alter stiffness and viscosity ahead of time by cocontracting or otherwise predictively changing the impedance. But because OFC can solve Bellman-type optimization problems, it can generate much more complex responses that are associated with higher-order task goals. This avoids the potential for failure of risk-aware control that can occur when the short-term value of states is a poor predictor of long-term value, either because it is incorrect or because there is insufficient knowledge to estimate current value. It seems likely that some combination of preplanned anticipatory reflexes and online reoptimization is used to generate flexible human motor behavior.

The ideas presented here are closely related to concepts developed within the framework of active inference (Friston, 2011; Friston & Ao, 2012). In active inference, movement is specified by the probability density over desired states, and motor commands are issued that minimize the difference between the desired (predicted) probability and the actual (estimated) probability. The feedback control law in our formulation (see equation 2.20) sets each control variable ui to be the inner product of the value function and the predicted change in state . This can be interpreted as trying to select commands that make the predicted change in state most closely match the value function. Therefore the value function can be interpreted as the desired change in state, which closely matches the concept behind active inference. Furthermore, in the special case where the set of controllers is such that Lip and Ljp are orthogonal for , are normalized, and span the full space, then the predicted response to the feedback control is
formula
4.1
so the learning rule 2.33 is equivalent to
formula
4.2
which shows that learning reaches equilibrium when the value function (or desired (predicted) change in probability) is equal to the actual change in probability. This is again very similar to the principle in active inference, in which learning of the dynamics (forward model) depends on the difference between the predicted and actual outcomes. Thus, active inference generalizes some of the ideas presented here, and it allows the results here to be cast as an inference problem, in which the goal is to match the predicted to the actual change in density over states. It also provides an alternative neural mechanism (Shipp, Adams, & Friston, 2013).

Risk-aware control allows for safe behavior in an unpredictable environment. An important prediction of the theory is that behavior will be modified by perceived risk even if failure has not yet been experienced. This is particularly important for failures with permanent or life-threatening consequences; such failures must be avoided, and this avoidance cannot occur by learning through experience. It is interesting to speculate that the natural variability in human movement may be a way that our bodies allow us to experience perturbations that the environment has not yet applied or that might occur only very rarely. This would allow us to learn the effect of perturbations while not having to experience them in a truly dangerous environment. This might be particularly important for infant learning; although the infant is in a completely safe (and generally unperturbed) environment, it is nevertheless important for him or her to learn how to handle the less predictable natural environment that he or she will soon occupy.

Awareness of risk guides all of our actions, and this is essential for our survival in an unpredictable and potentially hostile environment. Flexibility in the face of a changing environment is characteristic of humans but not robots. Risk-aware control provides a computational model for a nonlinear stochastic feedback controller with a neural implementation that mimics the flexibility and responsiveness of human motor behavior under uncertainty.

Acknowledgments

Support for this project was provided by the National Institutes of Neurological Disorders and Stroke (NS069214), and the James S. McDonnell Foundation.

References

Bellman
,
R.
(
1957
).
Dynamic programming
.
Princeton, NJ
:
Princeton University Press
.
Bizzi
,
E.
,
Accornero
,
N.
,
Chapple
,
W.
, &
Hogan
,
N.
(
1984
).
Posture control and trajectory formation during arm movement
.
J. Neurosci.
,
4
,
2738
2744
.
Bouisset
,
S.
, &
Do
,
M.
(
2008
).
Posture, dynamic stability, and voluntary movement
.
Neurophysiol. Clin.
,
38
(
6
),
345
362
.
Chow
,
G. P.
(
1976
).
Analysis and control of dynamic economic systems
.
New York
:
Wiley
.
Davis
,
M.
, &
Vinter
,
R.
(
1985
).
Stochastic modelling and control
.
London
:
Chapman and Hall
.
Friston
,
K. J.
(
2011
).
What is optimal about motor control
?
Neuron
,
72
(
3
),
488
498
.
Friston
,
K. J.
, &
Ao
,
P.
(
2012
).
Free energy, value, and attractors
.
Comput. Math. Methods Med.
,
2012
,
937860
.
Galerkin
,
B.
(
1915
).
On electrical circuits for the approximate solution of the Laplace equation
.
Vestnik Inzh.
,
19
,
897
908
.
Gardiner
,
C.
(
1985
).
Stochastic methods
(4th ed.).
Berlin
:
Springer
.
Hamilton
,
A.
, &
Wolpert
,
D.
(
2002
).
Controlling the statistics of action: Obstacle avoidance
.
J. Neurophys.
,
87
,
2434
2440
.
Ma
,
W. J.
,
Beck
,
J. M.
,
Latham
,
P. E.
, &
Pouget
,
A.
(
2006
).
Bayesian inference with probabilistic population codes
.
Nature Neuroscience
,
9
(
11
),
1432
1438
.
Massion
,
J.
(
1992
).
Movement, posture and equilibrium: Interaction and coordination
.
Prog. Neurobiol.
,
38
(
1
),
35
56
.
Mink
,
J. W.
(
1996
).
The basal ganglia: Focused selection and inhibition of competing motor programs
.
Progress in Neurobiology
,
50
(
4
),
381
425
.
Nagai
,
H.
(
1996
).
Bellman equations of risk sensitive control
.
SIAM J. Control Optim.
,
34
,
74
101
.
Osu
,
R.
,
Franklin
,
D. W.
,
Kato
,
H.
,
Gomi
,
H.
,
Domen
,
K.
,
Yoshioka
,
T.
, &
Kawato
,
M.
(
2002
).
Short- and long-term changes in joint co-contraction associated with motor learning as revealed from surface EMG
.
Journal of Neurophysiology
,
88
(
2
),
991
1004
.
Pouget
,
A.
,
Dayan
,
P.
, &
Zemel
,
R.
(
2000
).
Information processing with population codes
.
Nature Reviews Neuroscience
,
1
(
2
),
125
132
.
Sanger
,
T. D.
(
1998
).
Probability density methods for smooth function approximation and learning in populations of tuned spiking neurons
.
Neural Computation
,
10
(
6
),
1567
1586
.
Sanger
,
T.
(
2010a
).
Neuro-mechanical control using differential stochastic operators
. In
Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE
(pp.
4494
4497
).
Piscataway, NJ
:
IEEE
.
Sanger
,
T. D.
(
2010b
).
Controlling variability
.
Journal of Motor Behavior
,
42
(
6
),
401
407
.
Sanger
,
T.
(
2011
).
Distributed control of uncertain systems using superpositions of linear operators
.
Neural Comput.
,
23
(
8
),
1911
1934
.
Shipp
,
S.
,
Adams
,
R. A.
, &
Friston
,
K. J.
(
2013
).
Reflections on agranular architecture: predictive coding in the motor cortex
.
Trends Neurosci.
,
36
(
12
),
706
716
.
Thoroughman
,
K. A.
, &
Shadmehr
,
R.
(
1999
).
Electromyographic correlates of learning an internal model of reaching movements
.
Journal of Neuroscience
,
19
(
19
),
8573
8588
.
Todorov
,
E.
, &
Jordan
,
M. I.
(
2002
).
Optimal feedback control as a theory of motor coordination
.
Nat. Neurosci.
,
5
,
1226
1235
.
Vincent
,
T. L.
, &
Grantham
,
W. J.
(
1999
).
Nonlinear and optimal control systems
.
New York
:
Wiley
.
Whittle
,
P.
(
1990
).
Risk-sensitive optimal control
.
New York
:
Wiley
.
Widrow
,
B.
, &
Hoff
,
M. E.
(
1960
).
Adaptive switching circuits
.
Belvoir, VA
:
Defense Technical Information Center
.