## Abstract

When modeling goal-directed behavior in the presence of various sources of uncertainty, planning can be described as an inference process. A solution to the problem of planning as inference was previously proposed in the active inference framework in the form of an approximate inference scheme based on variational free energy. However, this approximate scheme was based on the mean-field approximation, which assumes statistical independence of hidden variables and is known to show overconfidence and may converge to local minima of the free energy. To better capture the spatiotemporal properties of an environment, we reformulated the approximate inference process using the so-called Bethe approximation. Importantly, the Bethe approximation allows for representation of pairwise statistical dependencies. Under these assumptions, the minimizer of the variational free energy corresponds to the belief propagation algorithm, commonly used in machine learning. To illustrate the differences between the mean-field approximation and the Bethe approximation, we have simulated agent behavior in a simple goal-reaching task with different types of uncertainties. Overall, the Bethe agent achieves higher success rates in reaching goal states. We relate the better performance of the Bethe agent to more accurate predictions about the consequences of its own actions. Consequently, active inference based on the Bethe approximation extends the application range of active inference to more complex behavioral tasks.

## 1 Introduction

When trying to achieve goals, an acting agent typically lacks complete knowledge about its environment and is exposed to several sources of uncertainty in its environment. This makes the pursuit of goals a nontrivial task (Arthur, 1994; Simon, 1990).

Computational models for goal-directed behavior are typically based on the widely used computational framework of reinforcement learning (Sutton & Barto, 1998) with a large number of applications (Doll, Simon, & Daw, 2012; Rangel & Hare, 2010; Dayan & Niv, 2008; O'Doherty et al., 2004; Montague, Hyman, & Cohen, 2004). However, a strong limitation of classical reinforcement learning models is that they do not take into account the influence of various sources of uncertainty on human behavior (Rushworth & Behrens, 2008; Doya, 2008; Behrens, Woolrich, Walton, & Rushworth, 2007; Yu & Dayan, 2005). An increasing number of empirical findings have provided evidence that belief updating in humans closely follows that of a rational Bayesian agent (FitzGerald, Hämmerer, Friston, Li, & Dolan, 2017; Meyniel, Schlunegger, & Dehaene, 2015; Lake, Salakhutdinov, & Tenenbaum, 2015; Vossel et al., 2013; Payzan-LeNestour, Dunne, Bossaerts, & O'Doherty, 2013; Behrens, Hunt, Woolrich, & Rushworth, 2008; Daw, Niv, & Dayan, 2005). This suggests that humans actively use a representation of uncertainty when inferring the current and past states of the world and when making decisions (Friston & Kiebel, 2009; Lee & Mumford, 2003; Knill & Pouget, 2004; Dayan, Hinton, Neal, & Zemel, 1995).

In complex everyday environments, decision making is affected by various sources of uncertainty; hence, in such settings, it is useful to treat planning and action selection as an inference problem (Pearl, 1988; Attias, 2003; Botvinick & Toussaint, 2012; Friston et al., 2013). Under the planning-as-inference formulation, it is assumed that agents form beliefs (in a Bayes optimal manner) over possible future behaviors to decide on the sequence of actions that allows them to reach their goals.

When modeling human decision making, one typically postulates that the human brain uses an approximate inference scheme to update beliefs and generate plans (Mathys, Daunizeau, Friston, & Stephan, 2011; Nassar, Wilson, Heasly, & Gold, 2010; Daunizeau et al., 2010; Yuille & Kersten, 2006; Baker, Saxe, & Tenenbaum, 2005). Such an approximation is required to achieve computationally tractable and fast adjustments to behavior in a dynamic environments (Nassar et al., 2010).

One approximate inference approach that is generically used in a wide range of applications is variational inference (Blei, Kucukelbir, & McAuliffe, 2017; Wainwright & Jordan, 2008; Beal, 2003; Bishop, 2006). Variational inference also forms the formal basis of the free energy principle (Friston, 2010), which states that both action and perception underlie the minimization of the variational free energy of the past, current, and expected future sensations. As the variational free energy defines an upper bound on surprise (Bishop, 2006; Friston, 2010), minimizing the free energy minimizes an agent's surprise about its sensations. In turn, minimizing surprise improves an agent's representation of the environment and drives an agent to visit states from which the future is more predictable. This formulation was subsequently extended to model goal-directed behavior under uncertainty and is referred to as *active inference* (Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016). In recent studies, active inference was successfully applied in the analysis of behavioral (Friston et al., 2014; Schwartenbeck et al., 2015) and neuroimaging data (Schwartenbeck, FitzGerald, & Dolan, 2016; Schwartenbeck, FitzGerald, Mathys, Dolan, & Friston, 2014).

Here we will revisit the variational treatment of planning as inference—motivated by core concepts of active inference—and provide step-by-step derivations of an active inference agent starting from basic definitions of planning as inference (Attias, 2003; Botvinick & Toussaint, 2012). Importantly, we will base the derivations on the so-called Bethe approximation (Bethe, 1931, 1935), which will allow us to establish a formal link between the free energy principle and the set of update equations known as *belief propagation* (Friston, Parr, & de Vries, 2017; Yedidia, Freeman, & Weiss, 2005; Pearl, 1988).

The standard approach for deriving an active inference agent is to base approximate inference on the so-called mean-field approximation (Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016). The key difference between the Bethe and the mean-field approximation lies in the way approximate beliefs about trajectories are encoded. Technically, the mean-field approximation assumes that posterior beliefs about a sequence of states are approximated by a distribution in which beliefs over states are independent between time points. Crucially, this ignores the statistical dependencies inherent in state transitions, meaning that the approximate posterior estimates might converge to local optima of the free energy and exhibit overly confident belief representations throughout the decision-making process (Weiss, 2001; Murphy, 2012).

For example, if I know that being in the state 1 will always result in a transition to state 2, then the surprise on moving from state 1 to state 3 can be evaluated only if I have a joint distribution over both states. This is precluded in the mean-field approximation but is retained in the Bethe approximation. This follows because the approximate posterior beliefs about any particular state are conditioned on the previous state. Often these pairwise statistical dependencies under the Bethe approximation even correspond to the true spatiotemporal dependencies of hidden states in a dynamic environment, so that the approximate posterior provides a tighter bound on the surprise, and hence exhibits less deviation from the exact posterior (Weiss, 2001). In principle, this means that any approximate Bayesian inference about trajectories in the past—or in the future—should be more accurate under a Bethe approximation, leading to more optimal behavior. For this reason, the belief propagation algorithm is often applied in the machine learning field to sequential inference problems (Bishop, 2006; Yedidia et al., 2005; Yu & Kobayashi, 2003; Fan, 2001; Rabiner, 1989; Gelb, 1974; Kalman, 1960).

In what follows, we provide a detailed, and rather didactic, technical overview of the basic elements needed to define planning as an inference problem and relate its exact Bayesian solution to an approximate solution obtained using the variational approximation under either the mean-field or the Bethe approximation. To illustrate the approximation-dependent differences in goal-directed behavior in the presence of uncertainty, we will introduce mean-field and Bethe-based agents to a simple navigation task in a noisy grid world. Finally, using this proof-of-principle task, we will show that an agent based on the Bethe approximation exhibits enhanced performance as compared to a mean-field-based agent.

## 2 Methods

### 2.1 Generative Process

In this letter, we consider a sequential decision-making task in which an agent executes a finite number of actions (choices) in order to reach a goal in a specific environment. Each choice is associated with a discrete time step $t\u2208[1,T]$, where $T$ denotes the total number of time steps. Here we model this decision process as a partially observable Markov decision process (Drake, 1962; Martin, 1967; Astrom, 1965; Monahan, 1982).

The task is defined as a 5-tuple $(H,A,\Theta ,O,\Omega )$ (see Figure 1) where:

- •
$H$ denotes a finite-sized set of hidden states.

- •
$A$ denotes a finite-sized set of actions.

- •
$\Theta $ denotes action-dependent conditional transition probabilities between states.

- •
$O$ denotes a set of observations.

- •
$\Omega $ denotes state-dependent conditional observation probability.

Each time step $t$ of the generative process consists of the following components. Depending on the current state $ht\u2208H$, an observation $ot\u2208O$ is sampled from the generative probability $\Omega (ot|ht)$. Given an agent's choice of action $at\u2208A$, the environment will transit to a new state $ht+1$ sampled from the transition probability $\Theta (ht+1|ht,at)$. This process is repeated until the final time step $T$ is reached.

### 2.2 Generative Model

To efficiently solve the task, the agent needs an accurate representation of the generative process. The so-called generative model is a formal description of an agent's model of the hidden states of the environment and the rules that define their evolution. We formally define the generative model as a joint probability distribution over observations $ot$, hidden states $ht$, and behavioral policies $\pi $, which define a sequence of control states $ut$. Note that the control states denote a subjective abstraction of an action (e.g., a neuronal command to execute a specific action in the environment). For simplicity, we assume a one-to-one mapping between a selected control state $ut$ and executed action $at$ in each time step $t$. Table 1 provides an overview of the notation used in this letter.

Expression | Specification | Explanation |

$h1:T$ | $(h1,\u2026,hT)$ | Hidden states |

$ht$ | ${h1,\u2026,hnh}$ | Current hidden state |

$h\u203e$ | $(h1,\u2026,ht)$ | Past (visited) hidden states, including current hidden state $ht$ |

$h~$ | $(ht+1,\u2026,hT)$ | Future hidden states |

$o1:T$ | $(o1,\u2026,oT)$ | Observations |

$ot$ | ${o1,\u2026,ono}$ | Current observation |

$o\u203e$ | $(o1,\u2026,ot)$ | Past (fixed) observations, including current observation $ot$ |

$o~$ | $(ot+1,\u2026,oT)$ | Future observations (unknown) |

$u1:T-1$ | $(u1,\u2026,uT-1)$ | Control states |

$ut$ | ${u1,\u2026,unu}$ | Current control state |

$\pi $ | $u1:T-1$ | Policy, a sequence of control states |

$p(o1:T,h1:T,\pi )$ | Generative model, the agent's model of the rules of the environment | |

$p\u203e(o~)$ | Prior beliefs over future outcomes; these encode the agent's preference, or the utility of certain observations. | |

$f(o~,h1:T,\pi |o\u203e)$ | True posterior, to be maximized | |

$q(o~,h1:T,\pi )$ | $q(o~,h1:T|\pi )q(\pi )$ | Approximate posterior |

$q(o~,h1:T|\pi )$ | Agent's estimate of states and observations | |

$q(\pi )$ | $1Zp(\pi )e-V\pi -G\pi $ | Probability of following policy $\pi $ |

$F[q]$ | $V[q]+G[q]$ | Full variational free energy; minimized by approximate posterior. |

$V[q]$ | Observed free energy | |

$V\pi $ | Conditional observed free energy under policy $\pi $ | |

$G[q]$ | Predicted free energy | |

$G\pi $ | Conditional predicted free energy under policy $\pi $ |

Expression | Specification | Explanation |

$h1:T$ | $(h1,\u2026,hT)$ | Hidden states |

$ht$ | ${h1,\u2026,hnh}$ | Current hidden state |

$h\u203e$ | $(h1,\u2026,ht)$ | Past (visited) hidden states, including current hidden state $ht$ |

$h~$ | $(ht+1,\u2026,hT)$ | Future hidden states |

$o1:T$ | $(o1,\u2026,oT)$ | Observations |

$ot$ | ${o1,\u2026,ono}$ | Current observation |

$o\u203e$ | $(o1,\u2026,ot)$ | Past (fixed) observations, including current observation $ot$ |

$o~$ | $(ot+1,\u2026,oT)$ | Future observations (unknown) |

$u1:T-1$ | $(u1,\u2026,uT-1)$ | Control states |

$ut$ | ${u1,\u2026,unu}$ | Current control state |

$\pi $ | $u1:T-1$ | Policy, a sequence of control states |

$p(o1:T,h1:T,\pi )$ | Generative model, the agent's model of the rules of the environment | |

$p\u203e(o~)$ | Prior beliefs over future outcomes; these encode the agent's preference, or the utility of certain observations. | |

$f(o~,h1:T,\pi |o\u203e)$ | True posterior, to be maximized | |

$q(o~,h1:T,\pi )$ | $q(o~,h1:T|\pi )q(\pi )$ | Approximate posterior |

$q(o~,h1:T|\pi )$ | Agent's estimate of states and observations | |

$q(\pi )$ | $1Zp(\pi )e-V\pi -G\pi $ | Probability of following policy $\pi $ |

$F[q]$ | $V[q]+G[q]$ | Full variational free energy; minimized by approximate posterior. |

$V[q]$ | Observed free energy | |

$V\pi $ | Conditional observed free energy under policy $\pi $ | |

$G[q]$ | Predicted free energy | |

$G\pi $ | Conditional predicted free energy under policy $\pi $ |

denotes the joint probability over observed outcomes and past hidden states. In practice, we will derive the relations that define agent behavior (see Figure 1) by inverting the generative model. In what follows, we describe in more detail the components of the full generative model. For a visualization of statistical dependencies between the random variables, see Figure 2.

where $p(h1)$ denotes the prior beliefs about the initial state $h1$, and $p(ht|ht-1,\pi )$ denotes an agent's beliefs about possible transitions between states, conditioned on the policy $\pi $. This conditional dependency is illustrated by the right-pointing arrows in Figure 2. Note that each behavioral policy $\pi $ defines a specific control state at each time step $t$, that is, $\pi (t)=ut$. Hence the notation above is equivalent to replacing all $\pi $ terms with the corresponding control states $ut$ at time step $t$.

### 2.3 Planning as Inference

The core concept of planning as inference is that besides the hidden states and future observations (see Figure 2), we treat the behavioral variables (control states, that is, policies) as hidden variables to be inferred (Attias, 2003; Botvinick & Toussaint, 2012). This approach has the advantage that these two different processes can be described within the same mathematical framework of Bayesian inference (Doya, 2007; Botvinick & Toussaint, 2012). For this reason, the concept of describing planning as an inference process has found increasing interest within the cognitive neuroscience community (Botvinick & Toussaint, 2012; Solway & Botvinick, 2012; Friston, Daunizeau, Kilner, & Kiebel, 2010).

The steps of the inference procedure can be illustrated as follows (see Figure 1). After making an observation, the agent updates its current beliefs about current and past (hidden) states $h\u203e$ (perception); from the inferred current state, the agents form beliefs about future states $h~$ and observations $o~$ for each policy $\pi $ (planning).

Importantly, the beliefs over policies (sequence of control states) are modulated by an agent's preferences over unobserved future outcomes $o~$. We will represent these preferences as prior beliefs $p\u203e(o~)$. Importantly, $p\u203e(o~)$ defines the agent's goals, and thereby encodes the utility of various future outcomes (observations). For example, if the goal is to reach a specific location (e.g., a position in a maze), the prior over future outcomes corresponds to assigning a high probability of observing an outcome specific of the goal location and low probability for other observations. Note that the prior beliefs over future outcomes are in general distinct from the marginal expectations over future outcomes, that is, $p\u203e(o~)\u2260p(o~)$. The difference is that these prior beliefs encode which observations the agent wants to make, while the marginal expectations represent where the agent will be at a future time step, given the time evolution of the environment.

In practice, as the generative model may represent an arbitrarily complex environment, inferring posterior beliefs over hidden variables is typically not analytically tractable (Bishop, 2006). Therefore, to perform inference and select a policy, an agent would have to approximate the posterior beliefs (Friston & Kiebel, 2009; Friston et al., 2010).

### 2.4 Active Inference

The active inference solution to the problem of planning as inference rests on variational inference. Typically, the variational free energy has been used under active inference for finding an approximate posterior distribution for the true posterior, equation 2.6 (Friston et al., 2010, 2013, 2015; Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016).

#### 2.4.1 Variational Free Energy

Variational inference is a widely used approximate inference method (Blei et al., 2017; Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016; Friston et al., 2015, 2013; Wainwright & Jordan, 2008; Bishop, 2006; Beal, 2003). For our particular problem of planning as inference, it will allow us to approximate the true posterior distribution $f(o~,h1:T,\pi |o\u203e)$ with an approximate distribution $q(o~,h1:T,\pi )$.

*observed free energy*. It constrains the posterior beliefs over policies $\pi $ to only those policies that could have generated the observed sequence given the agent's generative model. We refer to the conditional free energy of the future $G\pi $ as

*predicted free energy*. The predicted free energy will be the main factor influencing policy selection. Here, the first term corresponds to pragmatic value or extrinsic value namely, the (negative) expected utility or log preferences over outcomes

In what follows, we derive the update equations for the conditional posterior $q(o~,h1:T|\pi )$ for two different approximations: the mean-field and the Bethe approximation.

#### 2.4.2 Mean-Field Approximation

#### 2.4.3 Bethe Approximation

Under the mean-field approximation, statistical independence of hidden variables was assumed. This has the advantage of simplicity, as it makes it possible to analytically calculate the approximate posterior directly from the full free energy. When performing a sequential decision-making task, however, hidden states of the environment are most likely not independent of each other; instead, the current hidden state might depend on the previous hidden state. In other words, if the environment has a sequential structure, the mean-field approximation may not be able to capture this structure accurately. To address this issue of representing a sequential structure within the approximate posterior, the Bethe approximation (Pearl, 1988; Yedidia, Freeman, & Weiss, 2000) can be used, as it allows for pairwise statistical dependencies between hidden variables in the approximate posterior. These dependencies map closely to the true statistical dependencies present in the generative model (see Figure 2).

For this reason, the Bethe approximation has found widespread use in the machine learning community (Felzenszwalb & Huttenlocher, 2006; Coughlan & Ferreira, 2002; Sudderth, Mandel, Freeman, & Willsky, 2004; Hua, Yang, & Wu, 2005; Meltzer, Yanover, & Weiss, 2005). Using this more complex approximate posterior, the variational free energy becomes more complex to evaluate as well. In the past, it was shown that the estimation of the approximate posterior under the Bethe approximation corresponds to the belief propagation update rules (Pearl, 1988; Yedidia et al., 2000). Belief propagation provides a framework to calculate the posterior beliefs using messages sent between nodes of the graph of the generative model. This solution using message passing provides the exact solution on a graph without loops, making the solution always converge to the global minimum of the variational free energy. (For a detailed overview of belief propagation, the Bethe approximation and their relation to the variational free energy we point readers to Yedidia, Freeman, & Weiss, 2003.)

Figure 3 shows a graphical representation of the posterior beliefs and messages on the graph of the generative model. Information from forward and backward inference processes is integrated for perception and planning. We denote these distinct pathways as *forward messages* and *backward messages*, respectively. Forward messages carry information from the past to the future, given the observations that were made and the states that were inferred. Backward messages pass back information from the prior beliefs about future outcomes and their corresponding states, and from observations already made to update the estimates of earlier states. The messages will be different for different control states, which makes them dependent on the policy $\pi $. For graphs without loops, these update rules converge to a unique solution at the global minimum of the free energy, for which approximate marginals equal the marginals of the true posterior $q(xj|\pi )=f(xj|o\u203e,\pi )$ (Pearl, 1988; Yedidia et al., 2000). Note that the beliefs do not converge to the posterior $p(xj|o\u203e,\pi )$ according to the generative model but to the true posterior $f(xj|o\u203e,\pi )$. This means that the beliefs do not correspond to optimal predictions but are averaged over expected (preferred) future outcomes.

### 2.5 Action Selection

*maximum selection*.

*averaged selection*.

For simplicity, we consider action selection to these two limiting cases. Note that it would be straightforward to introduce additional hidden variables that allow the agent to balance its behavior between model selection and model averaging (FitzGerald, Dolan, & Friston, 2014), as previously proposed in Friston et al. (2013).

### 2.6 Toy Environment

To illustrate and compare the goal-directed behavior that results from the above derived update equations based on the mean-field and Bethe approximation, we will use a navigation task in a $4\xd74$ grid world. The agent's task is to navigate from a starting position (red-shaded square) to a goal position (blue-shaded square; see Figure 4). Although a simple task, it is complex enough to illustrate the differences between the two approximations and provide insights into the limitations of the mean-field approximation.

At each time step, the agent makes an observation $ot$ that provides information about its current hidden state. In each state (node of the grid world), the agent can choose from $nu=4$ control states: go up, go down, go left, or go right. The task for the agent is to reach the goal state after making four choices. The number of time steps modeled in each run is $T=5$. Note that if the agent is at a boundary, the movement into the direction of the boundary fails, and the agent will not change its position.

After making an observation, the agent has to infer current and past states and build expectations about future states and observations. This process corresponds to calculating the approximate posterior $q(o~,h1:T|\pi )$. Given the policy-dependent posterior, the agent evaluates the total free energy $F\pi $ over all $N\pi =256$ possible policies. The total free energy defines the posterior beliefs over behavioral policies $q(\pi )$. In this specific environment, only six policies will lead to the goal state in the given time frame.

To illustrate the agent's behavior, we expose the agent to two different environments: (1) a grid world with varying observation uncertainty (see Figure 5a) and (2) a grid world with varying state transition uncertainty (see Figure 5b). With increasing observation uncertainty, the probability of making an observation associated with a neighboring state increases, while with increasing state transition uncertainty, the probability of remaining in the current state increases.

In this environment, to make the agent rely on the inference about the state space in order to reach the goal, we set the initial state in the area with high observation uncertainty. Therefore, in the initial state and depending on the initial observation, the agent's beliefs will be distributed over the possible starting states. Whether the agent reaches the goal state strongly depends on the initial observation. Importantly, out of the policies that lead to the goal, some lead through the states with high-observation uncertainty, while others lead to states with low-observation uncertainty. An interesting question here is whether the agent more often follows policies that lead toward low-observation uncertainty states, that is, whether the agent tends to reduce its initial uncertainty about the state space.

## 3 Results

Here we present the behavioral differences between the Bethe approxi-mation-based agent and the mean-field approximation-based agent for the two environments in the grid world. All presented cases were obtained as an average over 1000 runs in each environment.

### 3.1 Prior Preferences and Performance

A model parameter with a strong influence on the agent's behavior is the prior over future outcomes $p\u203e(o~)$ (see equation 2.39). This prior defines the agent's preferences over future observations and modulates the predicted free energy of a behavioral policy (see equation 2.19). To investigate the impact of the prior preferences on the performance of the agents, we varied the value of the prior $p\u203e(o\tau =g)=\rho $ between 0.5 and 0.999 and estimated the corresponding average success rate, defined as the percentage of trials in which the agent is at the goal location at the last time step $T$.

In Figure 6, we show the resulting success rates as a function of prior preference $\rho $ in different conditions and action selection methods. Several patterns are clearly visible. First, the success rates of agents using averaged action selection (top row of Figure 6) increase strongly with an increasing $\rho $, while the success rates of agents using maximum selection (bottom row) remain mostly constant and at higher levels compared to averaged selection. Second, in the environment with observation uncertainty (left column), the Bethe agent achieves consistently higher success rates, independent of the action selection method. Finally, in the environment with state transition uncertainty, the success rates of the agents are closely matched, with a slight advantage of the mean-field agent using the averaged selection for high prior preferences. In what follows, we explain what gives rise to this specific pattern of performance differences between agents and action selection methods.

The influence of the prior preferences $\rho $ on the success rates depends on the components that define posterior beliefs over policies. The key factor that determines the value of the conditional free energy $F\pi $—and therewith the posterior $q(\pi )$—is the cross entropy $-\u2211o\tau q(o\tau |\pi )lnp\u203e(o\tau )$. Hence, the ranking of the policies is independent on the value of prior preference $\rho $; however, their relative probabilities change. In other words, in the case of maximum selection, the value of $\rho $ does not influence which policy is selected by the agent, whereas in the case of averaged selection, the relative value of different policies has an effect on action selection. Thus, an increasing $\rho $ under averaged selection makes the agents' behavior more goal directed and thereby more successful.

### 3.2 Prediction Accuracy

To pinpoint the reason for the large difference in the performance between the two agents in the environment with observation uncertainty, we looked into the posterior beliefs over policies evaluated in the first time step $t=1$. Because the predicted probability of making the preferred observation in the final time step $q(oT=g|\pi )$ is the main contributor to the probability $q(\pi )$ of following a policy, we examined if the agents correctly predict that they will or will not reach the goal state when evaluating policies.

To do this, we calculated the true-positive and false-positive classification rates. When an agent correctly predicted reaching the goal state when evaluating one of the six policies that lead to the goal, we counted this as a true positive. When the agent incorrectly predicted reaching the goal when evaluating one of the remaining policies, we counted this as a false positive. Figure 7a shows the true-positive classification rate of both agents in the first time step. The Bethe agent has a $95%$ true positive rate, meaning that when evaluating a policy that could lead to a goal, it almost always correctly predicts that the policy will be successful. In contrast, the mean-field agent has a true positive rate below $60%$, incorrectly classifying policies as not leading to the goal state despite being successful policies. This low true-positive rate skews the approximate posterior $q(\pi )$, so that policies that would be good to follow have a low value, leading to erroneous behavior, and explaining the second effect of the overall lower success rates.

In Figure 7b, the false-positive values are shown. The Bethe agent has a false-positive rate close to $0%$, whereas the mean-field agent always has a false-positive rate greater than zero. In other words, the Bethe agent almost never assigns nonzero values to policies that do not lead to the goal state, whereas the mean-field agent predicts that some policies will lead to the goal when they do not. The false-positive rate of the mean-field agent increases with an increasing value of the prior preference over goal outcome $\rho $. This gives rise to the third effect, the drop in success rates for high prior values $\rho $, as the agent will follow policies that cannot lead to the goal state.

These differences in performance of the two agents can be related to the sensitivity of the gradient descent procedure (see equation 2.25) to the initial conditions (see equation 2.26). Indeed, we observe that changing the initial conditions of the gradient descent influences the final solutions, and hence the performance of the mean-field agent. However, for different environments, different initial conditions are required to improve the performance of the mean-field agent. This points to an underlying issue of the mean-field approximation when applied to sequential inference: we found that the approximate posterior over future states can converge to impossible state-space configurations. Thus, the agent predicts that it will execute an impossible state transition (i.e., jump across the grid), which causes an erroneous evaluation of the posterior over policies and elicits unfavorable behavior.

Interestingly, the closely matched success rates and the higher success rate of the mean-field agent in the environment with state transition uncertainty can also be related to the prediction of impossible state transitions. Even in the environment with state transition uncertainty, the mean-field agent accurately predicts the goal state only for the optimal policy (the path without transition uncertainty). For other policies, we again observe predictions of impossible state transitions for the majority of policies. This erroneous inference leads to a higher posterior value of the optimal policy $q(\pi optimal)$, which in effect improves the mean-field agent's performance, as it results in higher probability of following optimal policy when using averaged selection (see Figure 6b). Importantly, the higher the $\rho $ is, the larger is the penalty for policies predicted not to reach the goal state, which makes the mean-field agent better than the Bethe agent for the largest $\rho $.

### 3.3 Optimal Policy Selection

To illustrate the differences in agents' behavior in the two environments, we show in Figures 8 and 9 the average paths followed by the agent for the prior preference fixed to $\rho =0.999$.

In the case of the environment with observation uncertainty (see Figure 8) we see clear differences between the selected paths of the Bethe and mean-field agents. In contrast to the mean-field agent, the Bethe agents consistently follow only goal-reaching policies in a fairly symmetric selected path structure. The slight bias toward policies going to the right is not a result of the agents' higher valuation of policies that reduce uncertainty about the state space; rather, it is due to the stochastic nature of the first observation and the subsequent difference in inference about the starting state. Indeed, we find that the initial uncertainty about the occupied state is passed on to predictions about future states, so that the entropy of the agent's estimate about future states does not decrease, even when evaluating a policy that contains an informative (low-uncertainty) state (see section 4).

Although the mean-field agent follows similar paths when reaching the goal state, it surprisingly selects policies that lead to the left, away from the goal. These are stunning examples of trajectories where the agent falsely predicts that some policies will lead to the goal when they do not. The cause of this behavior is erroneous inference about the initial state in the presence of observation uncertainty, leading to false beliefs that the goal is not reachable from its initial state. When the agent believes that it is too far from the goal state, all policies are treated as equally likely, as the expectation is that none of them would lead to the goal state. This is why the agent chooses steps to the left even in maximum selection mode.

Interestingly, the false predictions of the mean-field agent (the convergence of posterior beliefs to impossible trajectories) are the main factor driving the behavior in the environment with observation uncertainty. Here, the agent's overconfidence about current policies and current states prevents it from switching to a different policy, even though the observations do not carry sufficient information. Furthermore, the mean-field agent shows a strong preference for policies leading through the high-uncertainty regime under maximum selection. We found the reason for this lies in reduced convergence issues for beliefs over future states (more accurate representation of state transition paths) when policies that lead through high-uncertainty regions are evaluated. The true-positive rate for these policies is higher than for other policies leading to the goal.

In the environment with state transition uncertainty (see Figure 9), the behavior of the two agents is very similar. Importantly, in the case of maximum action selection, both agents correctly valued the path with least uncertainty as optimal; hence, they always choose the optimal policy. In averaged selection mode, when actions are chosen by averaging over the values of policies, nonoptimal actions have a nonzero probability of being chosen. This effect increases with the number of policies. This causes a branching out from the optimal path and a subsequent drop in success rate. As discussed above, avoiding uncertainty is not a driving factor in the agent's evaluation of policies. Rather, policies are weighted according to the probability of reaching the goal state.

In summary, we found severe drawbacks in the mean-field agent's planning process. When inferring the future states for a given policy, the agent's beliefs would converge to impossible configurations of future states. In our formulation, both forward and backward messages shape the beliefs about the future. Such a setup leads to multimodal true posteriors as a result of a divergence between the forward and backward predictions. Under the gradient-descent procedure used here (see equation 2.25) for the mean-field agent, its beliefs settle around one of the modes (local optimum of the free energy). Since the value and probability of a policy are determined by the predicted probability of reaching the goal state, inaccurate beliefs about future states lead to inaccurate posterior beliefs over possible policies. Depending on the environment, this anomalous inference can lead to either reduced or increased performance of the mean-field agent.

The Bethe agent in our simulations, however, always accurately predicted future states given past observations, as the pairwise statistical dependencies explicitly prevent a divergence. Furthermore, the beliefs were able to better maintain this multimodality stemming from the superposition of the forward and backward messages. And as the convergence of the beliefs to the true posterior is guaranteed under the belief propagation update rules, the Bethe agent will always optimally predict future states. A correct prediction of the probability of reaching the goal state automatically leads to a more accurate policy evaluation as compared to the mean-field agent.

## 4 Discussion

We revisited a specific solution of planning as inference for modeling goal-directed behavior given by the active inference framework, where posterior beliefs about hidden states, future observations, and policies are obtained by minimizing the variational free energy. Importantly, we provide an alternative approach to the derivation of the key update equations of active inference agents. In contrast to previous formulations of active inference, the agent's behavior aims at minimizing the expectation over the predicted free energy instead of the expected free energy as postulated previously (Friston et al., 2015; Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016; Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016). This allowed us to reveal the effects of the mean-field approximation in the face of uncertainty. In future work, we will investigate and compare behavior that results from both formulations.

Besides the typically used mean-field approximation (Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016; Friston et al., 2015) we provide a variational treatment of planning as inference based on the Bethe approximation. In contrast to the mean-field approximation, under which statistical independence of hidden variables is assumed, the Bethe approximation assumes pairwise statistical dependencies between hidden variables in the approximate posterior. To demonstrate the key differences between acting agents based on the Bethe approximation and the mean-field approximation, we have designed two illustrative toy environments in which the agents had to perform a multitrial goal-reaching task while being exposed to either observation uncertainty or state transition uncertainty. We found that assuming pairwise statistical dependence between hidden variables improves an agent's inference of hidden states. This leads to more accurate predictions about the future and, consequently, evaluation of policies. These improvements resulted in more optimal goal-directed behavior and higher success rates.

In the environment with observation uncertainty (see Figure 5a), the state estimation was dependent on noisy observations. This environment illustrates a condition in which goal-directed behavior is generated under limited information about the current state of the environment. For example, in a maze task, an agent might not know exactly where it is due to ambiguity in the environment. Here, the Bethe agent showed consistently and dramatically higher success rates in goal-reaching behavior due to a more robust, policy-dependent inference of past, current, and future states and observations. We linked the low success rates of the mean-field agent to the erroneous formation of beliefs about hidden states. This misrepresentation of hidden states is caused by the convergence of posterior beliefs to configurations that are impossible under any given policy. This is due to the fact that agents infer the sequence of most probable states rather than the most probable sequence. When dealing with inference under uncertainty, the true posterior is often a multimodal distribution. However, under the gradient descent procedure used here, the posterior beliefs mostly converged to unimodal distributions so that one of the peaks of the true multimodal distribution becomes enlarged, while all other peaks vanish. As a result, the agent either misrepresents uncertainties, so that its beliefs represent only the most likely state, or the agent predicts states that are impossible from the perspective of forward planning but are likely from the perspective of backward planning (i.e., going from the goal state backward). Due to the overconfidence in beliefs over current states, and expectations about future states the mean-field agent cannot recover from an initial erroneous inference. This holds even after sampling more observations and forming more accurate beliefs over hidden states, which leads to an erroneous evaluation of behavioral policies. In contrast, the Bethe agent was able to rapidly adjust its evaluation of policies even if it had been misled by a noisy first observation.

Although a possible remedy for the mean-field agent may be to adapt the initial conditions in the gradient descent optimization procedure, these initial conditions would most likely be, as we found for our simulations, environment and task specific. Another way to resolve this issue for the mean-field agent might be to use a more sophisticated method than a simple gradient descent. It would also be possible to base the predictions only on the forward inference process (as done in previous work: Friston et al., 2015; Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016; Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016), instead of combining forward and backward inference. While this would lead to more accurate predictions of the future and possibly fewer convergence issues, it would strip the agent of the possibility to infer which states are on its way to the goal. We found that the Bethe approximation provides a principled solution, as it is able to capture the temporal structure of the environment and convergence to global optima is typically guaranteed in a sequential decision task environment.

In the environment with state transition uncertainty (see Figure 5b), hidden states were directly observable, but actions were executed stochastically. Here, the effect of erroneous state-space representation on success rates of the mean-field agent was reduced, in comparison to the environment with observation uncertainty. Both agents avoided high-uncertainty regions, illustrating that the driving factor in goal-directed behavior is the predicted probability of reaching the goal state.

Such avoidance of high-uncertainty states was not seen in the observation uncertainty condition, showing that agents do not intrinsically value informative states in our formulation using the predictive free energy, in contrast to previous formulations of active inference (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016; Schwartenbeck et al., 2015). Visiting a state associated with low observation uncertainty can be interpreted as information gathering, as the observation would be more informative about the underlying hidden state. We did not observe this behavior in the agents, which we relate to the fact that initial uncertainty about the state space is passed on to predictions about future states, keeping the expected entropy of a future state high and thereby making such a state not more valuable to the agent. In previous work on active inference (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016; Schwartenbeck et al., 2015), policy evaluation was done using a prior over policies defined using the expected free energy. The expected free energy contains a term evaluating the epistemic value (the informativeness of an action) of each policy. Using the expected free energy, agents follow informative policies with high epistemic value, meaning they tend to visit states with low observation uncertainty. As the epistemic value term does not follow under the derivation presented here (see section 2.4.1), where we derived a policy evaluation based on the predicted free energy, it is not surprising that we do not observe such behavior (see the appendix for details on the expected free energy).

The formulation of active inference under the mean-field ansatz has previously been put forward as a process theory of neuronal function (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016). Furthermore, (Friston, Parr, & de Vries, 2017) recently proposed a neuronal connection scheme for belief propagation update rules under active inference. However, the authors considered a modified belief propagation scheme in which the conditional dependencies among hidden states are ignored, hence allowing them to obtain update rules using the mean-field approximation. Under the Bethe approximation, the interpretation in terms of neural coding does not necessarily change and can be linked to past work on possible implementations of belief propagation in neuronal networks. For example, Shon and Rao (2005) and Ott and Stoop (2006) demonstrated an implementation of belief propagation using a neuronal network in cases when the generative model contains only pairwise interactions (like Bayesian graphs or Markov random fields). In this formulation, neurons are interpreted as nodes of the graph of the generative model and connections as conditional probabilities. In this scheme, the intuitive idea is that the activation of neurons encodes the beliefs about hidden variables, while the messages are transmitted by neural signal transaction. Similarly, Deneve (2004) showed as a proof of principle that inference based on belief propagation can be implemented in a network of spiking neurons. Interestingly, following this line of work, Lee and Mumford (2003) and Jardri and Denève (2013) discussed a possible link between belief propagation in cortical networks and optical illusions and hallucinations.

A potential issue with neuronal implementation of belief propagation arises when the generative model becomes more complex than the one used in this work. For example, it might require interaction of more than two variables. Mathematically, the Bethe approximation and the resulting belief propagation update equations scale well to these more complex models. However, in this case, the mapping of conditional beliefs and messages to neuronal architecture becomes more challenging and is subject to ongoing discussion. It might be necessary to have an extra neuronal pool to calculate the messages (George & Hawkins, 2009; Steimer, Maass, & Douglas, 2009).

An example of a more complex model is a hierarchical generative model (Friston, Rosch, Parr, Price, & Bowman, 2017). Here, a mixture of approximate representations of the posterior could be used. In this case, different levels of the hierarchy could be represented independently in the posterior (mean-field approximation), and pairwise interactions would be captured only within the same levels of representation (Bethe approximation). Additionally, learning principles have recently been introduced to active inference (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016; Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016), which could easily be combined with the Bethe approximation. It would be interesting in the future to explore whether the appropriate factorization of the posterior can be learned over time, which could lead to an emergence of the most effective approximation of a task environment.

In summary, we have presented a method for incorporating belief propagation within the active inference framework using the Bethe approximation. The presented update equations of the active inference framework complement past work (Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016; Friston, FitzGerald, Rigoli, Schwartenbeck, O'Doherty et al., 2016; Friston et al., 2015) and extend, in principle, the application range of active inference to complex behavioral tasks with various sources of uncertainty.

## Appendix: Relation between the Predicted and Expected Free Energy

In this formulation, the expected free energy contains two terms. The first term encodes the extrinsic value of a policy, as it is minimized when the agent predicts that a specific policy will fulfill the prior expectations over future outcomes. The second term defines the expected ambiguity, that is, expected observational uncertainty at future time steps $\tau $. This term is minimized when an agent visits informative states.

However, under the predicted free energy, all terms but the norms of the messages cancel out (see equation 2.36) once the results for the approximate posterior are inserted. These norms can be interpreted as a trial-dependent surprise, encoding the discrepancy between the forward planning and the prior expectations over future outcomes. With the predicted free energy, independent of the decomposition, the probability of reaching the goal state is the driving factor for agent behavior.

Importantly, simulating agent behavior using the expected rather than the predicted free energy leads to a relative tendency to choose paths toward states with low-observation uncertainty. When these states are visited, an observation is more informative about its underlying hidden state. An agent thereby reduces its uncertainty about its current state. In future work, we will investigate whether we can recover this information-seeking behavior with the formalism based on the predicted free energy.

## Acknowledgments

This work was supported by the Deutsche Forschungsgemeinschaft (SFB 940/2, Projects A9 and Z2).