## Abstract

This article describes a process theory based on active inference and belief propagation. Starting from the premise that all neuronal processing (and action selection) can be explained by maximizing Bayesian model evidence—or minimizing variational free energy—we ask whether neuronal responses can be described as a gradient descent on variational free energy. Using a standard (Markov decision process) generative model, we derive the neuronal dynamics implicit in this description and reproduce a remarkable range of well-characterized neuronal phenomena. These include repetition suppression, mismatch negativity, violation responses, place-cell activity, phase precession, theta sequences, theta-gamma coupling, evidence accumulation, race-to-bound dynamics, and transfer of dopamine responses. Furthermore, the (approximately Bayes’ optimal) behavior prescribed by these dynamics has a degree of face validity, providing a formal explanation for reward seeking, context learning, and epistemic foraging. Technically, the fact that a gradient descent appears to be a valid description of neuronal activity means that variational free energy is a Lyapunov function for neuronal dynamics, which therefore conform to Hamilton’s principle of least action.

## 1 Introduction

There has been a paradigm shift in the cognitive neurosciences over the past decade toward the Bayesian brain and predictive coding (Ballard, Hinton, & Sejnowski, 1983; Rao & Ballard, 1999; Knill & Pouget, 2004; Yuille & Kersten, 2006; De Bruin & Michael, 2015). At the same time, there has been a resurgence of enactivism; emphasizing the embodied aspect of perception (O’Regan & Noë, 2001; Friston, Mattout, & Kilner, 2011; Ballard, Kit, Rothkopf, & Sullivan, 2013; Clark, 2013; Seth, 2013; Barrett & Simmons, 2015; Pezzulo, Rigoli, & Friston, 2015). Even in consciousness research and philosophy, related ideas are finding traction (Clark, 2013; Hohwy, 2013, 2014). Many of these developments have informed (and have been informed by) a variational principle of least free energy (Friston, Kilner, & Harrison, 2006; Friston, 2012), namely, active (Bayesian) inference.

However, the enthusiasm for Bayesian theories of brain function is accompanied by an understandable skepticism about their usefulness, particularly in furnishing testable process theories (Bowers & Davis, 2012). Indeed, one could argue that many current normative theories fail to provide detailed and physiologically plausible predictions about the processes that might implement them. And when they do, their connection with a normative or variational principle is often obscure. In this work, we show that process theories can be derived in a relatively straightforward way from variational principles. The level of detail we consider is fairly coarse; however, the explanatory scope of the resulting process theory is remarkable—and provides an integrative (and simplifying) perspective on many phenomena that are studied in systems neuroscience. The aim of this article is to describe the basic ideas and illustrate the emergent processes using simulations of neuronal responses. We anticipate revisiting some issues in depth: in particular, a companion paper focuses on learning and the emergence of habits as a natural consequence of observing one’s own behavior (Friston et al., 2016).

This article has three sections. The first describes active inference, combining earlier formulations of planning as inference (Botvinick & Toussaint, 2012; Friston et al., 2014) with Bayesian model averaging (FitzGerald, Dolan, & Friston, 2014) and learning (FitzGerald, Dolan, & Friston, 2015). Importantly, action (i.e., policy selection), perception (i.e., state estimation), and learning (i.e., reinforcement learning) all minimize the same quantity: variational free energy. This refinement of previous schemes considers an explicit representation of past and future states, conditioned on competing policies. This leads to Bayesian belief updates that are informed by beliefs about the future (prediction) and context learning that is informed by beliefs about the past (postdiction). Technically, these updates implement a form of Bayesian smoothing, with explicit representations of states over time, which include future (i.e., counterfactual) states. Furthermore, the implicit variational updates have some biological plausibility in the sense that they eschew neuronally implausible computations. For example, expectations about future states are sigmoid functions of linear mixtures of the preceding and subsequent states. An alternative parameterization, which did not appeal to explicit representations over time, would require recursive matrix multiplication, for which no neuronally plausible implementation has been proposed. Under this belief parameterization, learning is mediated by classical associative (synaptic) plasticity. The remaining sections use simulations of foraging in a radial maze to illustrate some key aspects of inference and learning, respectively.

The inference section describes the behavioral and neuronal correlates of belief updating during inference or planning, with an emphasis on electrophysiological correlates and the encoding of precision by dopamine. It illustrates a number of phenomena that are ubiquitous in empirical studies. These include repetition suppression (de Gardelle, Waszczuk, Egner, & Summerfield, 2013), violation and omission responses (Bendixen, SanMiguel, & Schroger, 2012), and neuronal responses that are characteristic of the hippocampus, namely, place cell activity (Moser, Rowland, & Moser, 2015), theta-gamma coupling, theta sequences and phase precession (Burgess, Barry, & O’Keefe, 2007; Lisman & Redish, 2009). We also touch on dynamics seen in parietal and prefrontal cortex, such as evidence accumulation and race-to-bound or threshold (Huk & Shadlen, 2005, Gold & Shadlen, 2007; Hunt et al., 2012; Solway & Botvinick, 2012; de Lafuente, Jazayeri, & Shadlen, 2015; FitzGerald, Moran, Friston, & Dolan, 2015; Latimer, Yates, Meister, Huk, & Pillow, 2015).

The final section considers context learning and illustrates the transfer of dopamine responses to conditioned stimuli, as agents become familiar with experimental contingencies (Fiorillo, Tobler, & Schultz, 2003). We conclude with a brief demonstration of epistemic foraging. The aim of these simulations is to illustrate how all of the phenomena emerge from a single imperative (to minimize free energy) and how they contextualize each other.

## 2 Active Inference and Learning

This section provides a brief overview of active inference that builds on our previous treatments of Markov decision processes. Specifically, it introduces a parameterization of posterior beliefs about the past and future that makes state estimation (i.e., belief updating) biologically plausible. (A slightly fuller version of this material can be found in Friston et al., 2016.) Active inference is based on the premise that everything minimizes variational free energy (Friston, 2013). This leads to some surprisingly simple update rules for action, perception, policy selection, learning, and the encoding of uncertainty or its complement, precision. Although some of the intervening formalism looks complicated, what comes out at the end are update rules that will be familiar to many readers (e.g., integrate-and-fire dynamics with sigmoid activation functions and plasticity with associative and decay terms). This means that the underlying theory can be tied to neuronal processes in a fairly straightforward way. Furthermore, the formalism accommodates a number of established normative approaches, thereby providing an integrative framework.

In principle, the scheme described in this section can be applied to any paradigm or choice behavior. Indeed, earlier versions have been used to model waiting games (Friston et al., 2013), the urn task and evidence accumulation (FitzGerald, Schwartenbeck, Moutoussis, Dolan, & Friston, 2015), trust games from behavioral economics (Moutoussis, Trujillo-Barreto, El-Deredy, Dolan, & Friston, 2014; Schwartenbeck, FitzGerald, Mathys, Dolan, Kronbichler et al., 2015), addictive behavior (Schwartenbeck, FitzGerald, Mathys, Dolan, Wurst et al., 2015), two-step maze tasks (Friston, Rigoli et al., 2015), and engineering benchmarks such as the mountain car problem (Friston, Adams, & Montague, 2012). It has also been used in the setting of computational fMRI (Schwartenbeck, FitzGerald, Mathys, Dolan, & Friston, 2015).

In brief, active inference separates the problems of optimizing action and perception by assuming that action fulfills predictions based on inferred states of the world. Optimal predictions are therefore based on (sensory) evidence that is evaluated using a generative model of (observed) outcomes. This allows one to frame behavior as fulfilling optimistic predictions, where the optimism is prescribed by prior preferences or goals (Friston et al., 2014). In other words, action realizes predictions that are biased toward preferred outcomes. More specifically, the generative model entails beliefs about future states and policies, where policies that lead to preferred outcomes are more likely. This enables action to realize the next (proximal) outcome predicted by the policy that leads to (distal) goals. This behavior emerges when action and inference maximize the evidence or marginal likelihood of the model generating predictions. Note that action is prescribed by predictions of the next outcome and is not itself part of the inference process. This separation of action and perceptual inference or state estimation can be understood by associating action with peripheral reflexes in the motor system that fulfill top-down motor predictions about how we move (Feldman, 2009; Adams, Shipp, & Friston, 2013).

The models considered in this article include states of the world in the past and the future. This enables agents to select policies that will maximize model evidence in the future by minimizing expected free energy. Furthermore, it enables learning about contingencies based on state transitions that are inferred retrospectively. We will see that this leads to a Bayes-optimal arbitration between epistemic (explorative) and pragmatic (exploitative) behavior that is formally related to several established ideas (e.g., the infomax principle, Bayesian surprise, the value of information, artificial curiosity, and expected utility theory).

We start by describing the generative model on which predictions and actions are based. We then describe how action is specified by beliefs about states of the world under different policies. The section concludes by considering the optimization of these beliefs through Bayesian belief updating and implicit neuronal processing.

The parameters of categorical distributions over discrete states are denoted by column vectors of expectations , where the notation denotes sequences of variables over time, for example, . The entropy of a probability distribution is denoted by , while the relative entropy or Kullback-Leibler (KL) divergence is denoted by . Inner and outer products are indicated by , and , respectively. We use a hat notation to denote (natural) logarithms. Finally, implies . Definitions of the variables referred to are in Table 1.

Expression . | Description . |
---|---|

Outcomes, their posterior expectations and logarithms | |

Sequences of outcomes until the current time point | |

Hidden states and their posterior expectations and logarithms, conditioned on each policy | |

Sequences of hidden states until the end of the current trial | |

Policies specifying action sequences, their posterior expectations, and logarithms | |

Action or control variables | |

Likelihood matrix mapping from hidden states to outcomes and its expected logarithm | |

Transition probability for hidden states under each action prescribed by a policy at a particular time and its logarithm | |

Prior expectation of the hidden state at the beginning of each trial | |

Logarithm of prior preference over outcomes or utility | |

Variational free energy for each policy | |

Expected free energy for each policy | |

Bayesian model average of hidden states over policies | |

The vector encoding the entropy or ambiguity over outcomes for each hidden state | |

Expected outcome probabilities for each hidden states and their expected logarithms | |

Expression . | Description . |
---|---|

Outcomes, their posterior expectations and logarithms | |

Sequences of outcomes until the current time point | |

Hidden states and their posterior expectations and logarithms, conditioned on each policy | |

Sequences of hidden states until the end of the current trial | |

Policies specifying action sequences, their posterior expectations, and logarithms | |

Action or control variables | |

Likelihood matrix mapping from hidden states to outcomes and its expected logarithm | |

Transition probability for hidden states under each action prescribed by a policy at a particular time and its logarithm | |

Prior expectation of the hidden state at the beginning of each trial | |

Logarithm of prior preference over outcomes or utility | |

Variational free energy for each policy | |

Expected free energy for each policy | |

Bayesian model average of hidden states over policies | |

The vector encoding the entropy or ambiguity over outcomes for each hidden state | |

Expected outcome probabilities for each hidden states and their expected logarithms | |

Active inference rests on the tuple :

A finite set of outcomes

A finite set of control states or actions

A finite set of hidden states

A finite set of time-sensitive policies

A generative process that generates probabilistic outcomes from (hidden) states and action

A generative model with parameters , over outcomes, states, and policies , where returns a sequence of actions

An approximate posterior over states, policies and parameters with expectations

The generative process describes transitions among states in the world that generate observed outcomes. These states are referred to as hidden because they cannot be observed directly. Their transitions depend on action, which depends on posterior beliefs about the next state. In turn, these beliefs are formed using a generative model of how observations are generated. The generative model describes what the agent believes about the world, where beliefs about hidden states and policies are encoded by expectations. Note the distinction between actions (that are part of the generative process in the world) and policies (that are part of the generative model of an agent). This distinction allows actions to be specified by beliefs about policies, effectively converting an optimal control problem into an optimal inference problem (Attias, 2003; Botvinick & Toussaint, 2012).

### 2.1 The Generative Model

The generative model is at the heart of (active) Bayesian inference. In simple terms, the generative model is just a way of formalizing beliefs about the way outcomes are caused. Usually a generative model is specified in terms of the likelihood of each outcome, given their causes and the prior probability of those causes. Inference then corresponds to inverting the model, which means computing the posterior probability of (unknown or hidden) causes, given observed outcomes. In approximate Bayesian inference, this entails optimizing an approximate posterior so that it minimizes variational free energy. In other words, the difficult problem of exact Bayesian inference is converted into an easy optimization problem, where the approximate posterior minimizes a (variational free energy) functional of observed outcomes, under a given generative model. We will see later that when variational free energy is minimized, it approximates the (negative) log evidence or marginal likelihood of the outcomes, namely, the probability of the outcomes under the generative model.

In this model, observations depend only on the current state, while state transitions depend on a policy or sequence of actions. This sequential policy is sampled from a Gibbs distribution or softmax function of expected free energy, with inverse temperature or precision . Here is the free energy expected under each policy (see below). The role of the model parameters will be unpacked later, when we consider model inversion.

Note that the policy is a random variable that has to be inferred. In other words, the agent entertains competing hypotheses or models of its behavior in terms of policies. This contrasts with standard formulations in which a single state-action policy returns an action as a function of each state , as opposed to time, . Furthermore, the approximate posterior is parameterized in terms of expected states under each policy. In other words, we assume that the agent keeps a separate record of expected states—in the past and future—for each allowable policy.

The predictions that guide action are based on a Bayesian model average of these policy-specific states. This means that expectations about policies (and their precision) also have to be optimized. All the posterior probabilities over model parameters, including the initial state, are Dirichlet distributions (FitzGerald, Dolan et al., 2015). The sufficient statistics of these distributions are concentration parameters that can be regarded as the number of occurrences encountered in the past. In what follows, we first describe how actions are selected, given beliefs about the hidden state of the world and the policies currently being pursued. We then turn to the more difficult problem of optimizing the beliefs on which action is based.

### 2.2 Behavior Action and Reflexes

This specification of action is considered reflexive by analogy to motor reflexes that minimize the discrepancy between proprioceptive signals (i.e., primary afferents) and descending motor commands or predictions. Heuristically, action realizes expected outcomes by minimizing the expected outcome prediction error (Adams et al., 2013). Expectations about the next outcome therefore enslave behavior. If we regard competing policies as models of behavior, the predicted outcome is formally equivalent to a Bayesian model average of outcomes, under posterior beliefs about policies (last equality in equation 2.3).

For simplicity, we assume the agent has learned the consequences of action. More complete schemes would incorporate learning the consequences of action by analogy with learning transitions among hidden states.

Having specified action selection in terms of expected outcomes, we now consider how these expectations are optimized. In active inference, there are no stimulus-response links found in conventional formulations: choices or actions are separated from inference in the same way that peripheral reflexes are separated from processing in the central nervous system. This means all behavior rests on optimizing beliefs or expectations about the next state of the world. These expectations furnish predictions of the next outcome that action simply fulfills. Following action, a new observation becomes available, and the perception-action cycle starts again.

### 2.3 Free Energy and Expected Free Energy

Because KL divergences cannot be less than zero, the penultimate equality in equation 2.4 means that free energy is minimized when the approximate posterior becomes the true posterior. At this point, the free energy becomes the negative log evidence for the generative model (Beal, 2003). This means that minimizing free energy is equivalent to maximizing model evidence, which is equivalent to minimizing the complexity of accurate explanations for observed outcomes (the last equality in equation 2.4).

In the expected free energy, the relative entropy becomes the mutual information between hidden states and the outcomes they cause (and vice versa), while the log evidence becomes the log evidence expected under predicted outcomes. By associating the log-prior over outcomes with utility or prior preferences, , the expected free energy can also be expressed in terms of epistemic and extrinsic value (the penultimate equality in equation 2.5). This means that extrinsic value is the (log) evidence for a generative model expected under a particular policy. In other words, because our model of the world entails prior preferences, any outcomes that provide evidence for our model (and implicit preferences) have pragmatic or extrinsic value. In practice, utilities are defined only to within an additive constant, such that the prior probability of an outcome is a softmax function of utility: . This means prior preferences depend only on utility differences and are inherently context sensitive (Rigoli, Friston, & Dolan, 2016).

Epistemic value is the expected information gain (i.e., mutual information) afforded to hidden states by future outcomes and vice-versa.^{1} We will see below that epistemic value can be thought of as driving curiosity and novelty-seeking behavior, by which we resolve uncertainty and ignorance. A final rearrangement shows that complexity becomes expected cost—namely, the KL divergence between the posterior predictions and prior preferences—while accuracy becomes the accuracy expected under predicted outcomes (i.e., negative ambiguity). This last equality in equation 2.5 shows how expected free energy can be evaluated relatively easily; it is just the divergence between the predicted and preferred outcomes plus the ambiguity (i.e., entropy) expected under predicted states.

In summary, expected free energy is defined in relation to prior beliefs about future outcomes. These define the expected cost or complexity and complete the generative model. It is these priors that lend inference and action a purposeful or goal-directed aspect because they represent preferences or goals. These preferences define agents in terms of characteristic states they expect to occupy and, through action, tend to frequent.

There are several interpretations of expected free energy that appeal to and contextualize/established constructs. For example, maximizing epistemic value is equivalent to maximizing (expected) Bayesian surprise (Schmidhuber, 1991; Itti & Baldi, 2009), where Bayesian surprise is the KL divergence between posterior and prior beliefs. This can also be interpreted in terms of the principle of maximum mutual information or minimum redundancy (Barlow, 1961; Linsker, 1990; Olshausen & Field, 1996; Laughlin, 2001). This is because epistemic value is the mutual information between hidden states and observations: . In other words, it reports the reduction in uncertainty about hidden states afforded by observations. Because the KL divergence or information gain cannot be less than zero, it disappears when the (predictive) posterior beliefs are not informed by new observations. Heuristically, this means that epistemic policies will search out observations that resolve uncertainty about the state of the world (e.g., foraging to locate a prey or fixating on informative part of a face, such as the eyes or mouth). However, when there is no posterior uncertainty and the agent is confident about the state of the world, there can be no further information gain, and epistemic value will be the same for all policies, allowing preferences to dictate action.

Conversely, with no preferences (i.e., all outcomes are deemed equally likely), the most likely policies maximize uncertainty over outcomes (i.e., keeping all options open), in accord with the maximum entropy principle (Jaynes, 1957), while minimizing the entropy of outcomes, given the state. Heuristically, this means agents will try to avoid uninformative (low entropy) outcomes (e.g., closing one’s eyes) while avoiding states that produce ambiguous (high-entropy) outcomes (e.g., a noisy discotheque) (Schwartenbeck, Fitzgerald, Dolan, & Friston, 2013). This resolution of uncertainty is closely related to satisfying artificial curiosity (Schmidhuber, 1991; Still & Precup, 2012) and speaks to the value of information (Howard, 1966). It is also referred to as intrinsic value (see Barto, Singh, & Chentanez, 2004) for a discussion of intrinsically motivated learning). In one sense, epistemic value can be regarded as the drive for novelty-seeking behavior (Wittmann, Daw, Seymour, & Dolan, 2008; Krebs, Schott, Schütze, & Düzel, 2009; Schwartenbeck et al., 2013), in which we anticipate uncertainty that can be resolved (e.g., opening a birthday present: see also Barto, Mirolli, & Baldassarre, 2013).

The expected complexity or cost is exactly the same quantity minimized in risk-sensitive or KL control (Klyubin, Polani, & Nehaniv, 2005; van den Broek, Wiegerinck, & Kappen, 2010), and underpins related variational formulations of bounded rationality based on complexity costs (Braun, Ortega, Theodorou, & Schaal, 2011; Ortega & Braun, 2013). In other words, minimizing expected complexity renders behavior risk-sensitive, while maximizing expected accuracy induces ambiguity-sensitive behavior. In short, expected free energy covers nearly all measures that have been proposed to explain adaptive behavior, and has each as a special case.

In summary, the above formalism suggests that expected free energy can be carved in two complementary ways. First, it can be decomposed into a mixture of epistemic and extrinsic value, promoting explorative, novelty seeking, and exploitative, reward-seeking behavior, respectively (Friston, Rigoli et al., 2015). Equivalently, minimizing expected free energy can be formulated as minimizing a mixture of expected cost or risk and ambiguity. This completes our description of free energy. We now turn to belief updating that is based on minimizing free energy under the generative model we have described.

### 2.4 Belief Updating and Belief Propagation

*inference*means optimizing expectations about hidden states (policies and precision), while

*learning*refers to optimizing model parameters. This optimization entails finding the sufficient statistics of posterior beliefs that minimize variational free energy. These solutions are (see appendix A): For notational simplicity, we have used , , , and . Usually one would iterate the equalities in equation 2.7 until convergence. However, we can also obtain the solution in a robust and biologically more plausible fashion using a gradient descent on free energy (see appendixes B and C): This converts the discrete updates above into dynamics for inference that minimize state and precision prediction errors and , where these prediction errors are free energy gradients.

Solving these equations produces posterior expectations that minimize free energy to provide Bayesian estimates of hidden variables. This means that expectations change over several timescales: a fast timescale that updates posterior beliefs about hidden states after each observation (to minimize free energy over peristimulus time) and a slower timescale that updates posterior beliefs as new observations are sampled (to mediate evidence accumulation over observations): (see also Penny, Zeidman, & Burgess, 2013). Finally, at the end of each sequence of observations (i.e., trial of observation epochs), the expected (concentration) parameters are updated to mediate learning over trials (FitzGerald, Dolan, & Friston, 2015). These updates are remarkably simple and have intuitive (neurobiological) interpretations:

### 2.5 Belief Updating and Neuronal Dynamics

Updating hidden states corresponds to state estimation, under each policy. Because beliefs about the current state are informed by expectations about past and future states, this scheme has the form of a Bayesian smoother that combines (empirical) prior expectations about hidden states with the likelihood of the current observation (Kass & Steffey, 1989). However, the scheme does not use conventional forward and backward sweeps (Penny et al., 2013; Pezzulo, Rigoli, & Chersi, 2013), because all future and past states are encoded explicitly. In other words, representations always refer to the same hidden state at the same time in relation to the start of the trial, not in relation to the current time. This may seem counterintuitive, but this form of spatiotemporal (place and time) encoding finesses belief updating considerably and, as we will see later, has a degree of plausibility in relation to empirical findings.

The formulation in equation 2.8 is important because it describes dynamics that can be related to neuronal processes. In other words, we move a variational Bayesian scheme toward a process theory that can predict neuronal responses during state estimation and action selection (e.g., Solway & Botvinick, 2012). This process theory associates the expected probability of a state with the probability of a neuron (or population) firing and the logarithm of this probability with postsynaptic membrane potential. This fits comfortably with theoretical proposals and empirical work on the accumulation of evidence (Kira, Yang, & Shadlen, 2015) and the neuronal encoding of probabilities (Deneve, 2008), while rendering the softmax function a (sigmoid) activation function that converts membrane potentials to firing rates. The postsynaptic depolarization caused by afferent input can now be interpreted in terms of free energy gradients (i.e., state prediction errors) that are linear mixtures of firing rates in other neurons (or populations). These prediction errors play the role of postsynaptic currents, which drive changes in membrane potential and subsequent firing rates. This means that when there are no prediction errors, postsynaptic currents disappear and depolarizations (and firing rates) converge to the free energy minimum. Note that the above expressions imply a self-inhibition because prediction errors decrease when log expectations increase.

Technically, replacing the explicit solutions, equation 2.7, with a gradient ascent, equation 2.8, is exactly the same generalization of variational Bayes found in variational Laplace (Friston et al., 2007), namely, a generalized coordinate descent. This is nice, because it means one can think about process theories for variational treatments of Markov decision processes as formally similar to equivalent process theories for state-space models, such as predictive coding (Rao & Ballard, 1999; Bastos et al., 2012). There are some finer, neurobiologically plausible details of the dynamics of expectations about hidden states that we will consider elsewhere. For example, the modulation by implies activity-dependent (e.g., NMDA-R dependent) depolarization that enforces an excitation-inhibition balance (see appendix B).

### 2.6 Action Selection, Precision, and Dopamine

The policy updates are just a softmax function of their log probability, which has two components: the free energy based on past outcomes and the expected free energy based on preferences about future outcomes. In other words, prior beliefs about policies in the generative model are supplemented or informed by the free energy based on outcomes. Policy selection also entails the optimization of expected uncertainty or precision. This is expressed above in terms of the temperature (inverse precision), which encodes posterior beliefs about precision: .

Interestingly, the updates for temperature are determined by the difference between the expected free energy under posterior and prior beliefs about policies, that is, the prediction error based on expected free energy. This endorses the notion of reward prediction errors as an update signal that the brain might use, in the sense that if posterior beliefs based on current observations reduce the expected free energy, relative to prior beliefs, then precision will increase (FitzGerald, Dolan et al., 2015). This can be related to dopamine discharges that have been interpreted in terms of changes in expected reward (Schultz & Dickinson, 2000; Fiorillo et al., 2003) and marginal utility (Stauffer, Lak, & Schultz, 2014). We have previously considered the intimate (monotonic) relationship between expected precision and expected utility in this context (see Friston et al., 2014, for a fuller discussion). The role of the neuromodulator dopamine in encoding precision is also consistent with its multiplicative effect in equation 2.7, to nuance the selection among competing policies (Fiorillo et al., 2003; Frank, Scheres, & Sherman, 2007; Humphries, Wood, & Gurney, 2009; Humphries, Khamassi, & Gurney, 2012; Solway & Botvinick, 2012; Mannella & Baldassarre, 2015). We will return to this later.

### 2.7 Learning and Associative Plasticity

Finally, the updates for the parameters bear a marked resemblance to classical Hebbian plasticity (Abbott & Nelson, 2000). The parameter updates for state transitions comprise two terms: an associative term that is a digamma function of the accumulated coincidence of past (postsynaptic) and current (presynaptic) states (or observations under hidden causes) and a decay term that reduces each connection as the total afferent connectivity increases. The associative and decay terms are strictly increasing but saturating functions of the concentration parameters. Note that the updates for the connectivity parameters accumulate coincidences over time, because parameters are time invariant (in contrast to states that change over time). Furthermore, the parameters encoding state transitions have associative terms that are modulated by policy expectations.

In addition to learning contingencies through the parameters of the transition matrices, the vectors encoding beliefs about initial states accumulate evidence by simply counting the number of times an initial state occurs. In other words, if a particular state is encountered frequently, it will come to dominate posterior expectations. This mediates context learning in terms of the initial state. In practice, the parameters are updated at the end of each trial or sequence of observations. This ensures that learning benefits from postdicted states, after ambiguity has been resolved through epistemic behavior. For example, the agent can learn about the initial state even if the initial cues were completely ambiguous.

Collectively, the updates above constitute a formal description of perception and learning. In what follows, we will associate electrophysiological responses with depolarization (i.e., state prediction error) driving changes in neuronal activity. For simplicity, we recover this from the rate of change of the associated expectation (see equation 2.8).

### 2.8 Summary

By assuming a generic (Markovian) form for the generative model, it is fairly easy to derive Bayesian updates that clarify the relationships among perception, policy selection, precision, and action and how these quantities shape beliefs about hidden states of the world and subsequent behavior. In brief, the agent first infers the hidden states under each model or policy that it entertains. It then evaluates the evidence for each policy based on prior beliefs or preferences about future outcomes. Having optimized the precision or confidence in beliefs about policies, they are used to form a Bayesian model average of the next outcome, which is realized through action. The anatomy of the implicit message passing is not inconsistent with functional anatomy in the brain (see Friston et al., 2014, and Figures 1 and 2). Figure 1 reproduces the (solutions to) belief updating and assigns them to plausible brain structures. Figure 2 rehearses the belief updating in terms of the implicit computations. This functional anatomy rests on reciprocal message passing among expected policies (e.g., in the striatum) and expected precision (e.g., in the substantia nigra). Expectations about policies depend on expected outcomes and states of the world for example, in the prefrontal cortex (Mushiake, Saito, Sakamoto, Itoyama, & Tanji, 2006) and hippocampus (Pezzulo, van der Meer, Lansink, & Pennartz, 2014). Crucially, this scheme entails reciprocal interactions between the prefrontal cortex and basal ganglia (Botvinick & An, 2009), in particular, selection of expected motor outcomes by the basal ganglia (Mannella & Baldassarre, 2015).

In this scheme, the scope and depth of the policy search is exhaustive, in the sense that all policies entertained by an agent are encoded explicitly and all hidden states over the sequence of actions entailed by policy are continuously updated. This may sound like an overcomplete representation of policies; however, this sort of architecture is implicit in salience maps in the brain (Santangelo, 2015; Zelinsky & Bisley, 2015). This is because a salience map represents the value (e.g., epistemic value or Bayesian surprise) of all possible actions (e.g., saccadic eye movements), from which the best action is selected: see Mirza, Adams, Mathys, and Friston (2016) for a simulation of saccadic searches and scene construction using the current scheme. In the simulations below, each policy comprises two actions, whereas in Mirza et al. (2016), we used just a single action: each policy specified where to look next. In the next section, we use equation 2.8 to simulate neuronal responses and show that many familiar electrophysiological phenomena emerge.

## 3 Simulations of Inference

This section considers inference using simulations of foraging in a maze. Its aim is to illustrate belief updating as a process theory for commonly observed electrophysiological and behavioral responses. We first describe the simulation setup and then establish the construct validity of the scheme in terms of simulated electrophysiological responses. The simulations involve searching for rewards in a T-maze. This T-maze contains primary rewards such as food and cues that are not rewarding per se but disclose the location of rewards. The basic structure of this problem can be translated to any number of scenarios (e.g., saccadic eye movements to visual targets). The simulations use the same setup as in Friston et al. (2015) and is as simple as possible while illustrating some fairly complicated behaviors. This example can also be interpreted in terms of responses elicited in reinforcement learning paradigms by unconditioned (US) and conditioned (CS) stimuli. Strictly speaking, our paradigm is instrumental, and the cue is a discriminative stimulus; however, we retain the Pavlovian nomenclature when relating precision updates to dopaminergic discharges.

### 3.1 The Setup

An agent, such as a rat, starts in the center of a T-maze, where either the right or left arms are baited with a reward (US). The lower arm contains a discriminative cue (CS) that tells the animal whether the reward is in the upper right or left arm. Crucially, the agent can make only two moves. Furthermore, the agent cannot leave the baited arms after they are entered. This means that the optimal behavior is to first go to the lower arm to find where the reward is located and then retrieve the reward at the cued location.

In terms of a Markov decision process, there are four control states that correspond to visiting, or sampling, the four locations (the center and three arms). For simplicity, we assume that each control state takes the agent to the associated location, as opposed to moving in a particular direction from the current location. This is analogous to place-based navigation strategies mediated by the hippocampus (e.g., Moser, Kropff, & Moser, 2008). There are eight hidden states (four locations by two contexts) and seven possible outcomes. The outcomes correspond to being in the center of the maze plus the (two) outcomes at each of the (three) arms that are determined by the context (the right or left arm is more rewarding).

Having specified the state-space, it is now necessary to specify the matrices encoding contingencies. These are shown in Figure 3, where the matrix maps from hidden states to outcomes, delivering an ambiguous cue at the center (first) location and a definitive cue at the lower (fourth) location. The remaining locations provide a reward with probability depending on the context. The matrices encode action-specific transitions, with the exception of the baited (second and third) locations, which are absorbing hidden states that the agent cannot leave.

In general treatments, we would consider learning contingencies by updating the prior concentration parameters of the transition matrices, but we will assume the agent knows (i.e., has very precise beliefs about) the contingencies. This corresponds to making the prior concentration parameters very large. Conversely, we will use small values of to enable context learning. Preferences in the vector encode the utility of outcomes. Here, the (relative) utilities of a rewarding and unrewarding outcome were 3 and 3, respectively (and zero otherwise). This means, that the agent expects to be rewarded times more than experiencing a neutral outcome. Note that utility is always relative because the probabilities over outcomes must sum to one. As noted above, this means the prior preferences are a softmax function of utility . Associating utility with log probabilities is important because it endows utility with the same measure as information, namely, nats (i.e., units of information or entropy based on natural logarithms). This highlights the close connection between value and information (Howard, 1966).

Having specified the state-space and contingencies, one can solve the belief updating equations in equation 2.8 to simulate behavior. Prior beliefs about the initial state were initialized to for the central location for each context and zero otherwise. These concentration parameters can be regarded as the number of times each state, transition, or policy has been encountered in previous trials.

Figure 4 summarizes simulated behavioral and physiological responses over 32 successive trials using a format that will be used in subsequent figures. Each trial comprises two actions following an initial outcome. The first panel shows the initial states on each trial (as colored circles) and subsequent policy selection (in image format) over the 10 policies considered. These correspond to staying at the center and then moving to each of the four possible locations (policies 1–4; ending in the center, left, right, or lower arm), moving to the left or right arm and staying there (policies 5 and 6), or moving to the lower arm and then to each of the four locations (policies 7--10). The second panel reports the final outcomes (encoded by colored circles) and performance. Performance is reported in terms of preferred (i.e., utility of) outcomes, summed over time (black bars) and reaction times (cyan dots). Note that because utilities are log probabilities, they are always negative, and the best outcome is zero. The reaction times here are based on the actual processing time in the simulations (using the Matlab *tic-toc* facility) and are shown after normalization to a mean of zero and standard deviation of one.

In this example, the first couple of trials alternate between the two contexts with rewards on the right and left. After this, the context (indicated by the cue) remained unchanged. For the first 20 trials, the agent selects epistemic policies—first going to the lower arm and then proceeding to the reward location (i.e., left for policy 8 and right for policy 9). After this, the agent becomes increasingly confident about the context and starts to visit the reward location directly. The differences in performance—between these epistemic and pragmatic behaviors—are revealed in the second panel as a decrease in reaction time and an increase in the average utility. This increase follows because the average is over trials and the agent spends two trials enjoying its preferred outcome when seeking reward directly, as opposed to one trial when behaving epistemically. Note that on trial 12, the agent received an unexpected (null) outcome that induces a degree of posterior uncertainty about which policy it was pursuing, indicated by the red dot. This is seen as a nontrivial posterior probability for three policies: the correct (context-sensitive) epistemic policy and the best alternatives that involve staying in the lower arm or returning to the center. This loss of certainty is accompanied by a low-utility outcome and a suppression of phasic dopamine responses reporting the confidence in behavior.

The marked reduction in reaction times, with the emergence of pragmatic behavior, reflects the fact that the estimation of hidden states under policies that have a small posterior probability is omitted. This is a common device in Bayesian model averaging, where the evidence for implausible models that fall outside Occam’s window are not evaluated. Here, we removed policies with a relative posterior probability of 1/128 or less. Neurobiologically, this would entail a selective suspension of belief updating, mediated by neuromodulatory projections (omitted from Figure 1). When the agent becomes increasingly confident about the context, the precision of competing policies increases, enabling it to focus on a smaller number and select one quickly and efficiently.

The third panel shows a succession of simulated event-related potentials following each outcome. These are the rates of change of neuronal activity, encoding expectations about hidden states. The fourth panel shows phasic fluctuations in posterior precision that can be interpreted in terms of dopamine responses. Here, the phasic component of simulated dopamine responses corresponds to the rate of change of precision (multiplied by eight) and the tonic component to the precision per se (divided by eight; see appendix 5). The phasic part reflects the precision prediction error (cf. reward prediction error: see equation 2.8). These simulated responses reveal a phasic response to the cue (CS) during epistemic trials that emerges with context learning over repeated trials. This reflects an implicit transfer of dopamine responses from the US to the CS. When the reward (US) is accessed directly, there is a profound increase in the phasic response relative to the response elicited after it has been predicted by the CS.

The final panel illustrates learning in terms of the accumulated posterior expectations about the initial state. The implicit learning reflects an accumulation of evidence that the reward will be found in the same location. In other words, initially ambiguous priors over the first two hidden states come to reflect the agent’s experience that it always starts in the first hidden state. It is this context learning that underlies the pragmatic behavior in later trials. We talk about context learning (as opposed to inference) because, strictly speaking, Bayesian updates to model parameters (between trials) are referred to as learning, while updates to hidden states (within trial) correspond to inference.

### 3.2 Electrophysiological Correlates of Variational Belief Updating

Figure 5 shows responses during the first trial in a way that speaks to empirical responses in studies of spatial navigation and decision making. The upper left panel shows simulated neuronal activity (firing rate) for units encoding hidden states using an image (or raster) format. There are eight hidden states for each of the three epochs or moves. These responses are organized such that the first eight rows show the probability of the eight states in the first observation epoch (i.e., period before moving), while subsequent epochs are shown in the middle and lower rows. This format illustrates the encoding of states over time, where the past lies in the upper diagonal blocks and the future in the lower diagonal blocks. To interpret these responses in relation to empirical results, we assume that outcomes are sampled every 250 ms. Although this is a little fast for overt exploratory movements in a maze, it corresponds to the intervals between saccadic eye movements in visual exploration (Srihasam, Bullock, & Grossberg, 2009) and the rate at which syllables are articulated in normal speech (Gross et al., 2013). Furthermore, it corresponds to the timescale of neuronal dynamics in the hippocampus (e.g., the duty cycle of theta activity).

Note the changes in activity after each new outcome is observed. For example, the two units encoding the first two hidden states in the first epoch (circled) maintain their firing rate at equivalent levels, reflecting uncertainty about which of the two hidden states are occupied. However, after observing the cue, their activity diverges to properly infer that the first state was the central location under the second context. In other words, representations of the past are informed by current outcomes. The implicit postdiction enables the agent to update its representation (i.e., memory) of the initial state (i.e., past), which it can call on for context learning (see below).

The upper right panel plots the same information, highlighting two units (in solid lines), encoding the upper left and right location on the third epoch. These are the chosen and unchosen states, respectively. Initially, both units encode the same uncertain beliefs about the state that will be occupied, which are resolved in the second epoch and confirmed in the third. The ensuing pattern of firing reflects a saltatory or stepwise evidence accumulation in which expectations about occupying the chosen and unchosen states diverge as the trial progresses. This belief updating is formally identical to evidence accumulation described by drift diffusion or race-to-bound models (Solway & Botvinick, 2012; Zhang & Maloney, 2012; de Lafuente et al., 2015; Kira et al., 2015) and nicely recapitulates the emergence of a choice as evaluation of options proceeds (Hunt et al., 2012). Furthermore, the separation of timescales implicit in variational updating reproduces the stepping dynamics seen in parietal responses during decision making (Latimer et al., 2015).

The right middle panel shows the associated local field potentials, which are simply the rate of change of neuronal firing shown on the upper right. These simulated responses show that units encoding locations later in the trial peak earlier, as successive outcomes are observed. This necessarily results in a phase precession (Burgess et al., 2007; Lisman & Buzsaki, 2008; Lisman & Redish, 2009). In other words, units (e.g., place cells) encoding the same location at the same point in the trial reach their maximum activity more quickly with each successive (theta cycle) of evidence accumulation (see the arrows in the middle right panel of Figure 5). This phenomenon reflects the fact that locations visited toward the end of a trial only receive sensory evidence when they are encountered, at which point they quickly converge to their posterior expectations. The implicit encoding of trajectories through (state) space has many similarities with the notion of a to-do list that has been invoked to explain phase precession (Jensen, Gips, Bergmann, & Bonnefond, 2014).

The lower left panel illustrates simulated dopamine responses. Here, we see a phasic suppression when the cue (conditioned stimulus—CS) is located, followed by a phasic burst when the reward (unconditioned stimulus—US) is secured. The suppressive responses to the CS shown here are during the first trial. As noted above, these reductions quickly reverse and come to resemble the responses to the US after a few trials. We will return to this; however, we first consider the place coding responses of units representing hidden states.

### 3.3 Theta-Gamma Coupling and Place Cell Activity

The lower right panel of Figure 5 shows the same firing rate responses above but highlights units encoding the three locations visited (the thick green blue and red lines). These responses reflect increases in activity (during the second theta epoch) in the same sequence that the locations are visited. Empirically, this phenomenon is called a theta sequence: short (3–5) sequences of place cells that fire sequentially within each theta cycle, as if they were encoding time-compressed trajectories (Lisman & Redish, 2009).

In our setting, theta-gamma coupling is a straightforward consequence of belief updating every 250 ms (i.e., theta), where each observation induces phasic updates that necessarily possess high-frequency (i.e., gamma) components. This is illustrated in the middle left panel of Figure 5, which shows the response of the second (rewarded hidden state) unit before (dotted line) and after (solid line) filtering at 4 Hz. These responses are superimposed on a time frequency decomposition of the local field potential averaged over all units. The key observation here is that depolarization in the theta range coincides with induced responses, including gamma activity. The implicit theta-gamma coupling during navigation can be seen more clearly in Figure 6. This figure reports simulated electrophysiological responses over the first eight trials, with the top panel showing the responses of units encoding hidden states and the second panel showing the associated time frequency response (and depolarization of the first unit, after filtering at 4 Hz). The final two panels show the simulated local field potentials and dopamine responses using the same format as the previous figure. The key observation in this here is that fluctuations in gamma power (averaged over all units) are tightly coupled to the depolarization in the theta range (of single units).

Phase precession and theta-gamma coupling are typically observed in the context of place cell activity, in which units respond selectively when an animal passes through particular locations. This sort of response is easy to demonstrate under the current scheme. Figure 7 (upper right panel) plots the activity of two units encoding the rewarded locations at the right (green dots) and left (red dots) arms as a function of the location in the maze over the first eight trials. The trajectories (dotted lines) were constructed by adding random displacements (with a standard deviation of an eighth) to the trajectory prescribed by action. The dots indicate times at which the unit approached its maximal firing rate (i.e., greater than 80%) and illustrate place cell activity that is specific to the locations they encode. However, this response profile is unique to the units encoding the final location: units encoding the location in the second epoch fire maximally at both the target location and the preceding (cue) location (lower right panel).

We present these results to address an interesting question. Hitherto, we have assumed that units encode states (location) in a frame of reference that is locked to the beginning of a trial or trajectory. The alternative is that each unit encodes the state in relation to the current time, in a moving time frame. This distinction is shown schematically in the lower left panel of Figure 7. If we use a fixed frame of reference, the successive activities of the two units are described by rows of the raster, indicated with white numbers. Conversely, if the encoding uses a moving frame of reference, these units would show the activity along the leading diagonal of the raster, indicated by the red numbers. Crucially, in a moving frame of reference, all units would show classical place cell responses, whereas in a fixed frame of reference, some units will encode the location of states that will be visited in the future. This would lead to a more complicated relationship between neuronal firing and the location of the animal.

## 4 Context Learning

Having established that the Bayesian updates of expected hidden states and parameters have a degree of biological plausibility, we now turn to the correlates of parameter learning. In this article, the only parameters that are updated are those encoding prior beliefs about the initial state or context. These are the concentration parameters . In what follows, we look at the effects of context learning on electrophysiological responses and what would happen if we removed prior preferences to reveal purely epistemic behavior.

### 4.1 Repetition Suppression and Dopamine Transfer

Figure 8 uses the same format as Figure 6; however, here we compare two identical trials that differ only in terms of the agent’s prior beliefs about context. These trials are indicated by the arrows on the insert from Figure 4 (upper right in Figure 8) and have been associated with oddball and standard trials, respectively. The only difference is that the agent has become familiar with the context in which it enacts its epistemic policy. The increased efficiency and confidence afforded by context learning are expressed in terms of a faster encoding of hidden states and the emergence of a phasic dopamine (precision) response to the CS. In other words, the familiarity effects of repetitions of standard trials suppress evoked responses in units encoding the first state in the second epoch (blue circles). This can be seen clearly if we subtract the evoked response during the standard trial from the equivalent response during the oddball trial at the point of anticipation, in the second epoch. The result is shown in the right panel as a negative difference waveform that peaks at around 80 ms (or 180 ms allowing 100 ms conduction delays to occipital cortex). This is exactly the form of difference elicited in empirical oddball studies using sequences of repeating stimuli, where it is known as the mismatch negativity (Bendixen et al., 2012).

This repetition suppression is accompanied by profound changes in simulated dopamine responses that effectively reproduce the transfer of phasic dopamine responses from unconditioned to conditioned stimuli during learning (Schultz, Apicella, & Ljungberg, 1993; Bromberg-Martin & Hikosaka, 2009). In this instance, the learning corresponds to increasing confidence about the context in which choices are made (Fiorillo et al., 2003). This translates into a higher precision of beliefs about competing policies once the CS has resolved residual uncertainty. Note that this transfer from the US to the CS is direct and does not require any representation of intervening states (see (FitzGerald, Dolan et al., 2015) for a fuller discussion). The differences in responses in these two trials can be explained only by differences in prior beliefs about context, because the actions and outcomes were identical. But what about responses when outcomes are unpredicted?

### 4.2 Violation Responses and Simulated P300 Waveforms

Figure 9 uses the same format as Figure 6 but focuses on consecutive trials after a degree of context learning (the trials indicated by the arrows above the insert from Figure 4). The first trial is a standard one in which the agent interrogates the cue location and then acquires the reward from the appropriate arm. In the subsequent trial, we forced the agent to stay at the cue location (by preventing it from moving), thereby inducing protracted belief updating about hidden states. This is most evident in the hidden state encoding the true location in the third (final) epoch (blue circles). These violation responses reach peak amplitude at about 100 ms—or 200 ms in peristimulus time (allowing for 100 ms conduction delays). Although earlier than classical P300 and N400 responses, this protracted and late response is reminiscent of violation responses in event-related potential (ERP) studies when the outcome is inconsistent with the preceding succession of states, such as semantic violations in sentence processing and action observation (Friederici, 2005; Maffongelli et al., 2015). These late violation responses contrast with the early mismatch responses in the previous figure. Finally, note that the phasic dopamine response to the unexpected outcome is attenuated although not abolished. This may reflect the fact that the agent finds it difficult to believe it has not secured its reward. In other words, the agent partly believes it has pursued the epistemic policy despite evidence to the contrary (see upper panel).

### 4.3 Foraging for Information

One might ask what would happen if rewards were devalued by setting their (relative) utility to zero. Figure 10 shows the results of a simulation, using the same setup as in Figure 4. The only difference here was that there were no explicit preferences or utilities. However, the resulting behavior is still structured and purposeful because it is driven by epistemic value. In every trial, the agent moves to the cue location to resolve ambiguity about the context (see lower panels). After the cue is sampled, uncertainty cannot be reduced further, and the agent either stays where it is or returns to the central location, avoiding the baited arms. It avoids the baited arms because they are mildly ambiguous (given our partial reinforcement schedule). This sort of simulation can, in principle, be used to simulate foraging for information using saccadic eye movements.

This simulation illustrates the fact that behavior can still be purposeful even in the absence of extrinsic value or prior preferences about outcomes. In other words, epistemic value can, on its own, specify behavior even if there are no explicit or extrinsic goals. The implication here is that the balance between purely exploratory and exploitative behavior rests on the precision of prior preferences. In the simulations, the removal of preferences corresponds to making every outcome equally plausible, thereby setting the precision of prior preferences to zero. Having said this, outcomes are still limited to those entertained by an agent’s beliefs about the world. The set of outcomes entailed by a particular generative model could be construed as preferred outcomes, where all other possible outcomes have been eliminated and, effectively, have a large negative utility. This means that in one sense, explorative or epistemic behavior is always restricted to outcomes that, a priori, an agent prefers or, equivalently, outcomes that characterise an agent.

### 4.4 Summary

In summary, we have reviewed several simulated responses that bear a remarkable resemblance to empirical electrophysiological responses in spatial navigation and classical ERP paradigms. We have also seen responses characteristic of dopaminergic activity during instrumental learning, when examining the encoding of precision. Although the similarity between simulated and empirical responses is at best metaphorical, it is interesting to note that all of these behaviors emerged from a standard variational scheme that was applied to a generic state-space model. In other words, there was no attempt to reproduce empirical findings by hand-tuning the generative model or the inversion scheme. The only thing we assumed was that outcomes are sampled every 250 ms. More specifically, the neuronal dynamics in equation 2.8 follow from a gradient descent on variational free energy, where variational free energy is defined completely by the generative model, and the generative model is based on a generic Markovian process. This is important because it provides a putative variational principle for neural dynamics that can be described in terms of a Lyapunov function (variational free energy) from a dynamical systems perspective (Stam, 2005). Alternatively, we can think of neuronal activity as conforming to Hamilton’s principle of least action, where action is the path integral of free energy (Friston, 2013). In short, the simulations above constitute a construct validation of the ensuing process theory in relation to empirical electrophysiology and the numerous normative models inspired by these empirical phenomena. Clearly, variational principles do not, in and of themselves, prescribe the aspects of neurobiology we have considered, in the same sense that natural selection does not prescribe a particular phenotype. However, they may offer a relatively straightforward and teleological perspective on neuroanatomy and physiology (Friston & Buzsaki, 2016).

## 5 Conclusion

We have described an active inference scheme for discrete state-space models of choice behavior that is suitable for modeling a variety of paradigms and phenomena. This generic scheme offers a process theory that is based on a standard (gradient descent) minimization of variational free energy—or approximate Bayesian inference. The ensuing process theory provides a simple (perhaps oversimplified) account of many empirical phenomena that include repetition suppression, omission responses, violation responses, place cell activity, phase precession, theta sequences, theta-gamma coupling, evidence accumulation, race-to-bound dynamics, and transfer of dopamine responses. It is worth reiterating that these emergent properties follow from, and only from, the form of the underlying generative model.

In this sense, the challenge is to identify the generative models that best explain empirical responses. We have focused on a simple and generic form, but there are clearly many alternatives and extensions. Key among these are hierarchical models with deep temporal structure (George & Hawkins, 2009; Specht, 2014), and models in which prior preferences are absorbed into beliefs about state transitions or contingencies. Appendix F touches on further extensions that consider not the path integral of expected free energy but the expected path integral of free energy and the distinction between naive and sophisticated schemes. This distinction may be particularly important for understanding planning and metacognition and their physiological correlates (Lisman & Redish, 2009; Penny et al., 2013; Pezzulo et al., 2014).

In closing, one should acknowledge that good process theories ”should explain what is already known more parsimoniously than any other theory of comparable explanatory scope, but they should also stick their neck out to specify what is forbidden, and what new phenomena have not been observed yet but should be or could be” (personal communication from an anonymous reviewer). We will not meet this challenge here; however, it is interesting to note that the epistemic imperatives implied by minimizing variational free energy lead to a parsimonious (minimally complex) yet accurate description of observable outcomes or facts (see equation 2.4). In this sense, active inference may offer a formal (metatheoretical) description for the scientific process itself.

## Appendix A: Belief Updating

## Appendix B: Generalized Coordinate Descent

## Appendix C: Belief Propagation

## Appendix D: Exact Bayesian Inference

## Appendix E: Simulating Dopamine Responses

## Appendix F: Sophisticated Schemes

In this case, the expected free energy after the next outcome is evaluated in the same way as the expected free energy at the current time for each (fictive) outcome by using the posterior over current hidden states as the prior . Clearly, this scheme is computationally more involved than the naive scheme and calls on recursive variational updating. This means that sophisticated agents are metacognitive in some sense because they perform belief updating (based on fictive outcomes) to optimize their belief updating.

Heuristically, the difference between naive and sophisticated schemes can be seen in terms of the first choice in current paradigm. For the naive agent, the best policy is to sample the cue location and stay there, because moving to a baited arm has, on average, no extrinsic value (and provides ambiguous outcomes). Conversely, the expected free energy of retrieving a reward after observing the cue is low for both (fictive) outcomes. This means the best policies are to behave epistemically on the first move and then pragmatically on the second move. Note that the sophisticated agent, unlike the naive agent, can entertain future switches between policies.

## Acknowledgments

K.J.F. is funded by the Wellcome trust (088130/Z/09/Z). P.S. is a recipient of a DOC fellowship of the Austrian Academy of Sciences at the Centre for Cognitive Neuroscience, University of Salzburg. G.P. gratefully acknowledges support of HFSP (Young Investigator Grant RGY0088/2014). We thank our reviewers for detailed help in formulating these ideas.

We have no disclosures or conflict of interest.

## References

## Note

^{1}

Note that the negative mutual information (which is never positive) is not an expected KL divergence (which is never negative). This is because the expectation is under the joint distribution over outcomes and hidden states. Furthermore, epistemic value is never positive, which means that the best one can do is to have an epistemic value of zero; in other words, a preferred outcome is expected with probability one.