Abstract
Complex information processing systems that are capable of a wide variety of tasks, such as the human brain, are composed of specialized units that collaborate and communicate with each other. An important property of such information processing networks is locality: there is no single global unit controlling the modules, but information is exchanged locally. Here, we consider a decision-theoretic approach to study networks of bounded rational decision makers that are allowed to specialize and communicate with each other. In contrast to previous work that has focused on feedforward communication between decision-making agents, we consider cyclical information processing paths allowing for back-and-forth communication. We adapt message-passing algorithms to suit this purpose, essentially allowing for local information flow between units and thus enabling circular dependency structures. We provide examples that show how repeated communication can increase performance given that each unit’s information processing capability is limited and that decision-making systems with too few or too many connections and feedback loops achieve suboptimal utility.
1 Introduction
A fundamental organizing principle in any complex system is modularity. Whether it is cells coming together to form organisms, diverse neurons comprising brains, or humans establishing relationships and creating firms and societies, the shared characteristic is that organized groups have the capacity to accomplish tasks that are beyond the reach of any single individual. This idea was captured and studied extensively in a wide variety of fields, including artificial intelligence (Amer & Maul, 2019; Ellefsen et al., 2015), cognitive science (Fodor, 1983; Katahira et al., 2012), psychology (Bechtel, 2003; Samuels, 2012), management theory (Langlois, 2002), and evolutionary biology (Constantino & Daoutidis, 2019). Organized group behavior is often modeled in terms of a network of decision-making units that operate with constrained resources while optimizing a global objective through local interactions and communication. The global objective is usually specified by a shared utility function, sometimes supplemented with a cost function that captures information processing resources, thus forcing a trade-off between maximizing utility and minimizing cost.
Previously, we have studied networks of bounded rational information processing decision makers that implement such a trade-off between a global utility function and a local information processing cost in each decision node. So far we have focused on strictly hierarchical models with feedforward information flow (Genewein et al., 2015; Gottwald & Braun, 2019). To capture the full potential of such systems, however, we need to allow for recurrent communication protocols, which also requires a new mechanism that allows for the optimization of cyclical communication paths. In this article, we study how far belief propagation algorithms can be adapted to serve such a purpose. We consider an arbitrary network of bounded rational decision-making agents collaborating to solve a problem defined by a single utility function. Each agent operates autonomously, focusing on solving a local problem that contributes to the overarching task. However, as the whole system of agents optimizes a global utility function, information about other agents has to be transferred throughout the network. To this end, we combine the advantages of the belief propagation algorithm with the bounded rationality framework.
Belief propagation is a message-passing algorithm for inference in graphical models that exploits the graphical model structure to determine the marginal distributions for all the unobserved variables in the model, given the values of all observed variables (Pearl, 1988, 2013; Yedidia et al., 2003, 2005; Rehn & Maltoni, 2014; Steimer et al., 2009; Straszak & Vishnoi, 2019; Wainwright & Jordan, 2008). Belief propagation has found a wide range of applications. It is applied in computer vision for image segmentation and object recognition (Yan et al., 2023; Carbonetto et al., 2004), bioinformatics for protein structure prediction and network analysis of biological data (Soni et al., 2010; Isci et al., 2013), and social network analysis for identifying communities and influential nodes (Savage et al., 2014; Chunaev, 2020). In the context of bounded rationality, we might think of inference as the process of determining the probability distributions required for action sampling.
One central feature of belief propagation is its widespread applicability, including different types of graphical models, and its natural extension from tree graphs to graphs containing cycles. Cyclic graphs have gained significant attention in various research areas due to their ability to capture complex dependencies and dynamic relationships. In the field of machine learning, recurrent neural networks have emerged as powerful models for sequential data processing, utilizing cyclic connections to propagate information through time steps (Sak et al., 2014). This has paved the way for advancements in natural language processing, speech recognition, and time series analysis (Hongmei et al., 2013; Schmidt & Murphy, 2012). These models leverage cyclic connections to model contextual dependencies and improve accuracy. The exploration of cyclic graphs in various research domains continues to drive innovation and facilitate the development of more sophisticated models.
In what follows, we provide a short summary of related work in section 2 and an overview of the technicalities of bounded rational decision making in section 3, starting from preliminary comments on the (bounded) rational decision-making problem up until the organization of multiple decision-making units into multistep feedforward decision-making architectures. We also provide a description of a belief propagation algorithm, namely, the sum-product algorithm, and explain how we weave this algorithm into the bounded rational information processing framework to increase the efficiency of an otherwise often intractable update rule. Section 4 then generalizes this update rule to arbitrary graphical models, dismissing the necessity of a strictly feedforward information flow and allowing for repeated correspondence between decision nodes. In section 5, we give examples for the different information processing methods described in the previous section and provide evidence that decision-making processes containing loops outperform corresponding feedforward structures on various tasks, especially when information-theoretic resources for the decision makers involved are sparse. In section 6, we discuss our findings, and section 7 concludes the article.
2 Related Work
2.1 Bounded Rationality
One prominent approach that deals with utility-information trade-offs in complex decision networks is bounded rationality. Herbert Simon (1943, 1955) argued that bounded rational individuals face information-processing constraints when making decisions that can result in deviations from models of perfect rationality. Simon’s work on bounded rationality catalyzed interdisciplinary studies in economics (Visco & Zevi, 2020), management (Schiliro, 2013, 2012), psychology (Viale et al., 2023), and sociology (Bögenhold, 2021), revealing how limited information processing capabilities affect decision making. It also influenced the development of algorithms in computer science and artificial intelligence to mimic human decision making in areas like automated reasoning and machine learning (Hüllermeier, 2021).
Information-theoretic notions of bounded rationality (Ortega & Braun, 2013, Ortega et al., 2015; Mattsson & Weibull, 2002; Sims, 2003; Tishby & Polani, 2010; Friston, 2010; Still, 2009; Todorov, 2009; Bhui et al., 2021; Kappen et al., 2012; Wolpert, 2004; Leibfried & Braun, 2016; McKelvey & Palfrey, 1995) generalize the maximum expected utility principle by including an additive computational cost in the shape of a Kullback-Leibler divergence to measure deviations from prior behavior. If the prior is optimally chosen, then this cost is the mutual information between input and output. The resulting behavior trades off expected utility and computational cost, as represented by the optimal posterior taking the form of a Boltzmann distribution with inverse temperature parameter, interpolating between a completely rational decision maker (low temperature) and a decision maker that does not deviate from its prior behavior (high temperature).
2.2 Multiagent Systems
A common way to deal with complex tasks is splitting them up into multiple simpler subtasks. In the realm of artificial intelligence and computing, multiple approaches have emerged from this idea to solve intricate problems. Multiagent optimization harnesses the collective intelligence of multiple autonomous agents that collaborate to find best solutions, where each agent typically possesses its own objectives and capabilities (Cerquides et al., 2014; Lobel et al., 2011; Terelius et al., 2011). Communication between agents allows for shared information and coordination needed for overcoming problems arising from limited knowledge about the overall system (Shirado & Christakis, 2017).
The approach of decentralized control distributes decision-making authority across various different agents, ensuring a resilient and adaptive system (Bakule, 2008; Yan et al., 2014; Schilling et al., 2021). Each agent in a decentralized system acts autonomously based on local information without relying on a central source of information or utility. Another approach that leverages decentralization, federated machine learning, consists of a method to train a model across multiple decentralized devices while keeping data localized (Yang et al., 2019; Wang et al., 2022). In a recent publication, Friston et al. (2023) extended the federated learning framework by introducing the concept of belief sharing based on active inference among agents and showed how communication enhances the collaborative learning process. Multiagent active inference was also previously used to describe the relationship between individual and collective inference in multiagent systems (Heins et al., 2022).
2.3 Game Theory
In the field of game theory, coordination games model interactions between players that benefit from aligning their choices with one another. In such games, communication is often a key ingredient for successful cooperation as many experimental studies have pointed out (Cooper et al., 1992; Cooper, 1999; Cason et al., 2012). One solution concept used in context of coordination games is rationalizability. It is based on rationality and the common belief in rationality and allows for uncertainty or incomplete information about the actions of other players (Bergemann & Morris, 2017). Furthermore, it was examined to what extent players can find best responses to previous actions played by other players with whom they were connected within a network (Jackson & Watts, 2002; Watts, 2001).
2.4 Belief Networks for Action Selection
Classically, inference is applied to unobserved state-like random variables (considered hidden causes of observations), and actions are treated as (nonprobabilistic) model parameters. However, in the more recent literature, actions are often treated as random variables themselves, effectively transforming influence diagrams into Bayesian networks. In control as inference (Toussaint & Storkey, 2006; Kappen et al., 2012; Todorov, 2008; Toussaint, 2009; Levine, 2018), the analogous treatment of actions and state variables is also applied to inference, which is not only performed over state variables but also over actions. This allows the application of a vast amount of available inference techniques, such as exact Bayesian inference using conjugate priors, approximate inference using variational free energy optimization, and other methods (Levine, 2018). Formulating decision making as an inference problem also enables robust, adaptive, and probabilistic modeling (Shachter & Peot, 1992), allowing for uncertainty quantification (Bratvold et al., 2010) and the incorporation of prior knowledge (Huang et al., 2012). Moreover, approximate variational inference over actions is used in active inference and the (variational) free energy principle (Friston et al., 2017; Schwöbel et al., 2018; Millidge et al., 2021; Mitchell et al., 2019; Solway & Botvinick, 2012; Parr et al., 2019) to discuss the biological plausibility of inference mechanisms.
Note that while the variational free energy over action distributions and the free energy used in information-theoretic bounded rationality are formally equivalent under certain conditions (see section 6), they have different use cases and therefore appear in different scenarios: the latter is the result of trading-off performance (usually measured in terms of expected utility) and informational costs, whereas the former, variational free energy, is used in variational inference to approximate Bayes posteriors (see Gottwald & Braun, 2020; Gershman, 2019 for a detailed comparison).
2.5 Message Passing
Message-passing algorithms, such as the classic belief propagation algorithm by Pearl (1988), are computational tools leveraging the structure of the underlying Bayesian network to efficiently perform inference, both for exact inference in tree-like graphs, as well as for approximate inference in graphs with loops (Yedidia et al., 2003, 2005).
Therefore, when performing inference over actions, it has also been shown to be a valuable tool for control as inference (Toussaint, 2009; Levine, 2018). Additionally, such message-passing schemes are examined for their potential to serve as inference descriptions in biological networks as an attempt to explain neural processing (Friston et al., 2017; Schwöbel et al., 2018; Parr et al., 2019).
2.6 Energy-Based Neural Models and Local Interactions
Local update rules and local interactions between units form the basis of some fundamental energy-based neural network models like Hopfield networks (Hopfield, 1982) and Boltzmann machines (Ackley et al., 1985). In classical and modern Hopfield networks, the updated state of a neuron depends on the states of its neighbors, enabling the storage and retrieval of patterns with low error (Tolmachev & Manton, 2020; Millidge et al., 2022), parameter estimation (Fazzino et al., 2021), and new ways of convolutional neural networks (Miconi, 2021) and deep learning (Ramsauer et al., 2020). Boltzmann machines can be seen as the stochastic counterpart of Hopfield networks as their global energy is identical in form to that of Hopfield networks and their update to each unit depends on probability distributions associated with their neighbors (Agliari et al., 2013; Osogami, 2017; Ota & Karakida, 2023).
3 Preliminaries
In this section we give a brief motivation and overview of previous formulations for bounded rational decision making with information constraints. This will provide the conceptual framework and the corresponding terminology required for section 4.
In the following, we consider a decision maker to be a mechanism of at least one (decision-making) unit that observes an input, adapts to the observation, and reacts accordingly to the best of its ability. In the case of a decision maker , this mechanism can be described by specifying its observation , its possible actions , and the posterior probabilities of choosing action given observation (see Figure 1 for an illustration). We refer to the probability as the decision maker’s prior (choice) probability over actions, which can be thought of as the probability of making a blind guess for an action should the observation not be available. The utility function is a measure of value or preference and assigns a real-valued scalar to each observation-action pair.
Unless stated otherwise, we use upper-case letters to denote random variables, which are assumed to take values in finite sets . For simplicity, the decision maker who decides about random variable is also simply referred to as . The upper-case letter is used for posterior probability distributions, the lower-case letter is used for joint and prior distributions, and is used for the set of all probability distributions defined on the set .
3.1 Motivation
Consider a utility function whose value depends on observation and action . In the absence of further constraints, the optimal choice probability concentrates its probability mass on those actions that maximize the utility for a given observation . In the deterministic case, we could simply devise an optimal mapping that assigns to each observation an action that maximizes the utility over the space of possible actions .
In the following, we are interested in distributed decision-making systems consisting of multiple decision-making units. For example, if we get just one more decision maker involved in the process, we could form a serial chain of decision makers . Without further constraints, adding a middleman does not add any capabilities and, in fact, can only make matters worse due to the data processing inequality, as can already represent any mapping from input to output. However, if we restrict the class of permissible mappings each individual decision maker can implement, we can gain representational power, as is well known in the case of multilayered feedforward networks.
If we were to allow for multiple duplicates of the decision maker , we could also use an additional decision maker as a selector or indicator that assigns different subsets of observations to different instantiations of the decision maker . Each of the instantiations could then be trained separately and specialize for their subset of observations. In such a parallel information processing architecture, we would essentially end up with a mixture-of-expert system that can store diverse prior knowledge through different expert decision makers.
In this article, we are particularly interested in distributed decision-making systems with recurrencies or loops. A simple example consisting of two decision makers could be , where we allow for a recurrent sequence of updates between and with distributions and . Such recurrent updating is reminiscent of alternating optimization schemes, where subsets of parameters are optimized in an alternating fashion, or Gibbs sampling, where we repetitively sample from conditional distributions to generate samples from a joint distribution (Geman & Geman, 1984). The underlying reason that repetitive sampling can lead to improved performance is that the partial optimization and sampling processes that can be achieved by a single information processing step are not independent of the other partial information processing steps; otherwise, a single forward sweep might be sufficient. Thus, if there is no single agent that is powerful enough to jointly optimize, multiple agents that are only capable of partial optimization have to alternate. In the following, we are interested in exploring the power of recurrent information flow in the context of bounded rational decision-making networks by considering two scenarios. In the first scenario, we assume that all decision nodes are optimally equilibrated at each update step. Accordingly, when a node is part of a loop in the decision network, the same node is updated multiple times and will be represented by a different distribution after each update. In the second scenario, we consider a continual updating scheme, where each decision node is represented by the latest update resulting from the previous update.
3.2 One-Step Bounded Rational Decision Making
The parameter interpolates between a nonadaptive decision maker whose posterior probability matches its prior probability for every observation () and a fully rational decision maker () that always selects the action with highest expected utility. Hence, whenever has a unique maximum, a rational agent’s posterior is a Dirac distribution , assigning zero probability to all actions different from an optimal action . For every , optimizing the objective 3.1 forces a trade-off between the expected utility and the transformation cost given by the relative entropy . This type of optimization objective has the form of a free energy that arises from constraints (Genewein et al., 2015; Gottwald & Braun, 2019, 2020).
Example. Imagine a teacher asking a student a question in an exam. For simplicity, there are 10 different possible questions that can roughly be categorized into two topics. For each question, there is a single best response that is rewarded with the best grade; all other responses that match at least the general topic get lower grades, and answers that do not even match the topic get the worst grades. In this scenario, the question asked by the teacher represents the observation, the student is the decision maker who has to find a good answer, and the student’s preparation can be regarded as the procedure of information processing. The parameter then describes the student’s capabilities, where a higher value of corresponds to a better student. Figure 1 illustrates this example using a graphical model (see Figure 1a) to describe student ’s answer depending on the question asked by the teacher. Figure 1b is an exemplary utility function where high utility (dark blue) corresponds to better grades and low utility (white, light blue) corresponds to lower grades. Figure 1c displays the average utility gained by students with a certain parameter and Figures 1d and 1e show the posterior probability of the responses of a student with low and a student with high beta.
3.3 Multistep Bounded Rational Decision Making
The notion of a single decision maker from the previous section can be generalized to a multistep decision-making system by allowing multiple decision makers to be involved in the decision-making process (see Figure 2 for an example). Again, starting with an observation and resulting in an action , the decision-making process is decomposed into multiple intermediate steps. Each intermediate step consists of a new decision maker , and the multistep system is characterized by the connections between the different decision makers as given by the prior and posterior models.
Following Gottwald and Braun (2019), variables in the multistep decision-making process are labeled according to the sequence of information flow, that is, the decision-making system is characterized by a set of random variables , such that can obtain information about only if . Thus, does not depend on the output of any other decision maker and none of the decision makers depend on the output of , meaning corresponds to the observation and corresponds to the action output of the system. In other words, there is a (not necessarily unique) linear ordering of the variables involved in the decision-making process reflecting the information flow between the decision makers, which we discuss below.
A specific multistep architecture can be visualized by a directed graph, such that its nodes correspond to the decision makers involved in the decision-making process, its arcs describe the dependencies, and there is a topological sort of its nodes that corresponds to the linear ordering of the decision makers given by the information flow. In other words, in this topological sort, the graph is traversed in a way that ensures each node is encountered only after all its dependencies have been visited. Although generally the topological sort of a directed graph is not unique, we describe the decision-making process by defining prior and posterior models that correspond to one viable topological sort.
In the following, we use dashed arrows for prior-selector parent nodes and solid arrows for input nodes (see Figure 2 for examples and Figure 9 for an exemplary illustration of the interaction between prior-selector and input nodes). Moreover, as we allow the utility function to depend on an arbitrary subset of nodes involved in the decision-making process, whose indices we collect in the set denoted by , we differentiate between utility nodes and intermediate nodes by coloring the nodes in the graphs blue and white, respectively. Intermediate nodes serve as hidden signals that filter information for subsequent nodes.
In general, we can say that at every time step, a decision maker processes incoming signals if its predecessors have processed their corresponding incoming signals. The time ordering of information processing in a multi step decision-making process will play an important role in section 4, where we generalize the feedforward graphs from Gottwald and Braun (2019) summarized above to graphs with loops.
3.4 Belief Propagation and Inference
This section serves as a brief summary of belief propagation with the sum-product algorithm, as we will incorporate a variant of this message-passing algorithm into the Blahut-Arimoto iteration from the previous section.
This setup enables a message-passing scheme, known as the sum-product algorithm, involving two main operations: summing and multiplying. In the first step, the algorithm computes the sum of products of incoming messages for each node in the graph, which gives the relative probability that the node is in a specific state, and in the second step, the node sends messages to its neighboring nodes based on that relative probability. The process is repeated until convergence is achieved, and the final messages can then be used for inference (see appendix A.2 for details).
In the following section, we give an overview of how to modify a factor graph to solve a series of different inference tasks.
3.4.1 Augmented Factor Graphs for Inference
Given a factor graph that represents the joint probability distribution of a set of random variables , the belief propagation algorithm can be used to find beliefs about marginal distributions of each variable as discussed above (see appendix A.2 for a more detailed description). Additionally, whenever a set of variables appears together as input to a factor node , the belief can be determined by the normalized product of all incoming messages to the factor and the corresponding factor itself.
In order to obtain beliefs that approximate conditional probability distributions, for example, , the factor graph can be modified by adding a factor node connected to the variable node whose potential function consists of an indicator function fixing . Similarly, in order to obtain beliefs about joint distributions of an arbitrary subset of variables, a new factor node with constant potential function that is connected to all variable nodes with can be added to the factor graph. Running the belief propagation algorithm on such modified factor graphs results in beliefs about conditional or joint distributions, respectively.
In what follows, we describe how we connect and combine the different factor graph modifications for an efficient inference mechanism for multistep decision-making systems like the ones in the previous section.
3.4.2 Adapting Message Passing for Decision-Making Problems
The two places where probabilistic inference is necessary when calculating the optimal posteriors in section 3.3 are the optimal priors 3.10 and the conditional probabilities for the expectations when determining the effective utilities (see equation 3.8). Hence, in order to use message passing, we transform the graphical model underlying the decision-making process into a factor graph and add modifications depending on the inference task (see appendix A.4).
4 Bounded Rational Decision-Making with Arbitrary Decision Networks
We now present our main contributions. The prior and posterior models for the multistep decision-making systems studied in previous work (Gottwald & Braun, 2019) and discussed in section 3.3 can be represented by a directed graph where only leaves and roots of the graphical model are endpoints of the decision-making process, and, crucially, that communication between nodes happens in a feedforward manner through the network, where node can obtain information about another node only if there is a directed path from to . This description therefore does not include back-and-forth communication of decision makers, where information can flow in arbitrary directions.
In the following, we loosen the restrictions on the prior and posterior models such that the corresponding graphs may contain cycles and feedback loops, analogous to how the belief propagation algorithm is used on graphs containing cycles without convergence or correctness guarantees. As this removes the topological sort condition, the prior and posterior models are no longer sufficient to define the decision-making process as repeated communication between nodes is now allowed. Therefore, we expand the description of a decision-making system to also include the update schedule (see the example at the end of section 3.3) of the involved decision makers together with the prior and the posterior models. We denote this set of descriptions the communication structure of the decision-making process.
4.1 Information Processing with Communication Structures
In this section, we illustrate the effects on the decision-making processes under more general communication structures and how these structures allow for various interpretations of repeated communication between the involved decision makers. In particular, in section 4.2, we distinguish three different information-processing protocols depending on whether the posterior choice rules are optimized for each time step or continually updated and on whether the priors are temporally fixed or adaptive. We also demonstrate how these protocols can be interpreted using our running example from the previous section. In section 5, we provide further examples.
4.1.1 From Model-Specific to General Update Rules
As we drop the prerequisite of prior and posterior models to be represented by acyclic graphical models, the sets of leaves as well as the roots of the now arbitrary graphs might be empty. Additionally, as there no longer exists a viable topological sort for cyclic graphs, there is no longer a corresponding information processing schedule. Thus, for graphs that allow for repeated communication between decision makers it is necessary to define when and potentially how often decision makers communicate.
For example, if we change the posterior model, equation 3.12 of our running example, such that decision maker has a posterior of the form instead of , introducing a feedback loop from decision maker to , the update rule from the previous section is no longer applicable as it is not defined which of the decision makers , , and should be updated first, how long their communication lasts, and which of the decision makers should have the final update.
To tackle this problem, we define a general update schedule to be a family of sets that contain the nodes that update their prior to their posterior at step of the decision-making process. Note that each node of the decision-making process needs to appear at least once in the sets . The introduction of the general update schedule means that instead of a single decision maker, an arbitrary set of decision makers is allowed to process information at each step, and, analogously to how each posterior in equation 3.9 depends on the information-processing costs of all subsequent nodes (as part of the effective utility, equation 3.8), the posteriors of the nodes in the set shall depend only on the costs of the decision makers in the sets .
The choice of the update schedule is problem specific and describes when during the course of the process the decision makers communicate, whereas the prior and posterior models describe with whom the decision makers communicate. We call the integer constant the horizon of the decision-making process.
Analogous to the feedforward case, we find realizations of the decision-making process by Gibbs sampling. As the product of all posteriors of a loopy decision-making system no longer forms a joint distribution over all decision variables, we consider the empirical distribution resulting from Gibbs sampling as an approximation of the actual joint distribution of the variables instead.
4.1.2 Transformation from Prior to Posterior Distributions
So far, we have treated the posterior and prior updates inside the Blahut-Arimoto iteration of alternating equations 3.9 and 3.10 to remain the same as in section 3.3, just that now, we use the effective utilities, equation 4.2, for the posteriors of the nodes in . There are, however, more choices to be made considering the fact that in a loopy graph, each node is allowed to process information more than once. In particular, we consider priors to be either adaptive or fixed over communication steps, and moreover, we consider decision makers to either have a single posterior for all communication steps or to have a separate posterior for each step. Figure 4 displays the resulting information-processing protocols for our running example, which are described in more detail in the following section.
4.2 Information Processing Protocols
From the point of view of probability theory, we would treat each occurrence of in the schedule as a separate random variable with its own prior and posterior, attempting to transform the loopy graph back to a directed feedforward decision-making graph as studied in Gottwald and Braun (2019) and summarized in section 3. With such an approach, however, one would have to relabel nodes that appear multiple times and then decide which relabeled version should be used as input for which subsequent node in the unrolled graph. In larger graphs with multiple loops, there would be many such decisions, essentially replacing the loopy decision-making problem with an entirely different one. Instead, here, we keep the loopy dependency structure not only for the purpose of inferring the priors and effective utilities using loopy belief propagation, but also to motivate different processing protocols , depending on which information is shared among multiple instances of the same decision maker in (see algorithm 2 in the appendix for the resulting procedure).
In particular, we consider the following set of protocols :
Protocol 1: Each node in a cycle is represented by a posterior and prior that is optimized for each update step. The optimal priors are determined in each cycle from belief propagation.
Protocol 2: Same as protocol 1, except that the priors are never changed from their initial setting.
Protocol 3: Each node is represented by a continuously updated posterior. The prior of each node is given by the previous posterior of that node.
Notice that since the update schedule is no longer grounded in a topological sort of the nodes of the graphical model, the same prior and posterior models can be used to describe multiple different decision-making processes that allow for repeated communication, whereas in the feedforward case, variables are considered to be updated according to the information flow (i.e., according to a corresponding topological sort of the nodes).
4.2.1 Protocol 1: Time-Optimal Posteriors and Priors
In the first protocol, we assume a separate posterior together with its optimal prior for each occurrence of decision maker in , related by equation 4.3. However, in contrast to unrolling over time, we do not rename the decision makers, but rather consider time-dependent probability distributions. In particular, we keep the loopy graph for belief propagation to infer the priors as the marginals of the corresponding posteriors and to infer the auxiliary distributions that determine according to equation 4.2.
For our running example, this means, for example, for decision maker , who processes information at and , that beliefs of the priors and are determined using belief propagation, together with the auxiliary distributions and to determine and , which allows calculating and according to equation 4.3.
This update protocol fits a situation of repeated communication where continually adapting to the optimal prior is advantageous. A negotiation that consists of multiple rounds or a coordination game in which a player earns higher payoff whenever he chooses the same course of action as another player would be examples of such scenarios. In section 5 we provide a more detailed example of a bounded rational decision-making system that represents players in such a coordination game.
4.2.2 Protocol 2: Time-Optimal Posteriors with Fixed Priors
One could also imagine that the timescale of a communication process is too short for a decision maker to change its prior, which could be considered to change only in the long run. Hence, we might assume that the time dependency of the posteriors, equation 4.3, originates only from the effective utilities . We have therefore the same expressions for and as in protocol 1, but with a fixed prior , which, for simplicity, in our experiments, is taken to be the belief of the marginal over the posteriors , where denotes the first occurrence of node in the update schedule , the first time decision maker processes information. However, one could also imagine a marginal over time, where all decisions in are averaged.
With respect to our running example, this would mean that decision maker would have a fixed prior that is used for both posteriors, and .
4.2.3 Protocol 3: Continual Updating Scheme
The previous two protocols can be considered to model a communication process, resulting in a set of posteriors , one posterior for each node and time (or rather for those for which ), from which we can sample according to the schedule to simulate an optimal process obtained by the Blahut-Arimoto iteration.
Coming back to our introductory example from the end of section 3.2, where a student (node in Figure 1a) learns to react to a question asked by a teacher (node in Figure 1a), information processing according to this update protocol can be thought of as a repeated process with repeated access to the utility function, where the posterior is changed step by step rather than in a single instant, for example, because the student is granted repeated access to interactions with a (for simplicity, constant) teacher, resulting in increased performance.
5 Examples and Experiments
In this section, we present examples of decision-making systems with general communication structures and measure their performance in terms of their expected utility.
5.1 Repetition Leads to Increased Performance
Continuing the introductory example of a student and teacher from section 3.2, we keep the simple graphical model that determines the prior and posterior models and modify the update schedule of the communication structure such that the student’s posterior is updated multiple times (see Figure 5) and compare the differences in performance. For this example, we choose to use protocol 3, because a posterior that is gradually updated fits the description of a student who improves performance by extending preparation time . This way, a hard-working student (high ) with low computational resources (low ) achieves similar expected utility as a student with high computational resources (high ) but fewer repetitions (low ). Figure 5 illustrates this relationship. The trade-off between and , or more generally between beta and the total number of updates to the decision makers in the decision-making process, is discussed in more detail in the following examples and the discussion in section 6.
5.2 More Communication Fosters Enhanced Performance
We use the performance measured by the expected utility of the original feedforward decision maker averaged over all 1000 randomized utility functions as a baseline for the performance of the decision maker with the feedback loop and repeated communication. Figure 6 shows the baseline (red line parallel to the abscissa), the average performance of the loopy decision maker (blue dots) and a box plot containing the median, the first and third quartile of all performances for every value of ranging from 2 to 20. All data are presented relative to the baseline.
As we can see, the relative performance level of the loopy graph increases with growing and appears to be converging for higher values of , implying that increased communication between intermediate and final node leads to a higher expected utility. This holds for the information processing protocols 1, 2, and 3. The plot in Figure 6 shows the results for the third protocol.
5.3 Fewer Updates Can Be Better
In game theory, a coordination game is a type of game in which a player will earn a higher payoff whenever they choose the same course of action as another player (Cooper et al., 1992; Cooper, 1999). Here, we consider a situation in which two players play one or more coordination games chosen from a set of game matrices (see Figure 7b). This situation is modeled by a decision-making system composed of an observation node that decides on the type of game that is played and two decision nodes and that represent the players.
Considering the pure coordination game (game 1 in Figure 7b), a suitable game-theoretic solution concept is rationalizability (Bernheim, 1984), where it is assumed that both players act rationally and it is common knowledge that both players act rationally. In the pure coordination game example, player plays up if he can reasonably believe that player could play left because up is a best response to left. Additionally, can reasonably believe that plays left if believes that could play up. So, believes that plays up if it is reasonable for to believe that could play up. Continuing this argument generates an infinite chain of reasonable beliefs that leads to the players playing (up, left). Analogously, a similar process can be repeated for (down, right).
Here, we approximate this process with bounded rational decision makers for all the different games simultaneously by using a utility function that represents all of the game matrices (see Figure 7 for details). The decision makers are updated according to protocol 1 as all the deliberate decisions of the two player according to the above example are best represented by decision makers who find the optimal prior and posterior for every time step of the deliberation process.
Figure 7 summarizes the results of the simulation. Panels d and e show the expected utility achieved by decision-making systems for different combinations of and parameters and update schedule. In this example, the schedule leads to better performance than across almost all combinations of and parameters, evening out for high values of both and . Furthermore, looking at the average information processing cost for both schedules and combinations of parameters (see Figures 7g and 7h), it is apparent that although the average processing cost for the update schedule is higher, its performance is worse. This means that, depending on the specific problem, fewer updates can lead to better results.
5.4 More Connectivity Does Not Imply Higher Utility
6 Discussion
Analogous to previous studies (Genewein et al., 2015; Gottwald & Braun, 2019), the overarching principle behind the ideas presented in this article is the trade-off between an increase of a desirable quantity and a cost associated with a corresponding change in behavior. This trade-off is necessary as resources such as computational or cognitive capacities are typically limited and decision makers are forced to choose actions that are good enough rather than optimal. Here, we offer an additional approach to increase the expected utility of a decision maker. Like a student who repeats a difficult topic again and again or two partners in a cooperative game who argue back and forth multiple times, we allow the structure of the graphical model that describes the decision-making process to include cycles and feedback loops to mimic the gradual gain in performance of the student or the cooperative partners in each iteration of their respective information processing instance. Although graphical models with cycles open up these new pathways, they come with several drawbacks in the form of an increased challenge in inference tasks, an absence of convergence and correctness guarantees, a higher ambiguity in causal relationships, and the resulting lower level of interpretability and intuition behind the graphical model. To deal with some of these flaws, other authors have suggested methods that involved deleting edges in a directed graph (Castillo et al., 1998) or reversing edge orientations (Ariffin, 2018) to make them acyclic again. In contrast, we presented different information processing schemes to deal with the presence of loops in the graphical model of one or more bounded rational decision makers without deleting or changing any of the connections.
6.1 Repetition as Substitute for Information Processing Capabilities
As outlined by Figures 6, 7d and 7e, and 5b to 5d, the number of updates per Blahut-Arimoto iteration tends to contribute positively to the decision maker’s performance, and its influence resembles the influence of the parameter for every decision node in the decision-making process, often allowing to decrease the parameter (and correspondingly the actual boundary on the processing cost) while having the same or better performance for reasonably high values of . Figure 5f captures this relation for the simple toy example showing that the necessary value for a specific performance threshold behaves antiproportional to the parameter for this specific setting.
Although this seems to indicate that an increase in updates per node and per Blahut Arimoto iteration increases the overall performance of the decision-making system, increasing the number of updates is a double-edged sword as more updates lead to a larger impact of subsequent updates to the effective utility, which might result in a decrease in performance. Two examples for this are Figure 7e (in comparison to 7d) and Figure 8, where we increased the number of updates to the node by increasing the number of nodes in the terminal update list and varying the number of edges in a given graph, respectively. In both cases a larger average number of updates to every node leads to a decline in performance, which is loosely reminiscent of the problem of information overload (Schroder et al., 1967), which, in the context of this article, means that too many feedback connections decrease the overall performance of a system. Additionally, Figure 7h suggests that adding more nodes to the terminal update list does not increase the overall performance although effectively doubling the number of updates to the nodes resulting in an increased information processing cost.
To summarize, there is no straightforward strategy to determine the best initiation of a graphical model for a given problem. Unlike the typically monotone increase in performance for growing parameters, the influence of the number of feedback loops and updates to the performance of different nodes in the graph stagnates and falls off for increasingly large values.
Nevertheless, feedback loops and recurrent processing have proven to be valuable tools, enabling top-down information processing and enhancing object recognition (Rao & Ballard, 1999; Ernst et al., 2021; Spoerer et al., 2020; Herzog et al., 2020). They also play a crucial role in modeling human-inspired attention mechanisms, including recent enhancements to transformer-like attention mechanisms with recurrent connections (Stollenga et al., 2014; Mittal et al., 2020; Zeng et al., 2021; Ju et al., 2022).
6.2 Limitations
The examples provided in the previous section represent only a few applications where general communication structures consisting of a graphical model and an update schedule can be used. Although the flexibility of the unrestricted composition possibilities of nodes and edges to form graphical models and corresponding update schedules to describe communication for almost every scenario are one of the upsides of our approach, it makes a more exhaustive analysis (as it was done in Gottwald & Braun, 2019) of different communication structures intractable.
One disadvantage of the free energy from constraints approach is the choice of Lagrange multipliers that are introduced to the model. Their meaning comes from their connection to the constraints on the processing cost in the original constrained optimization problem, for which there is no closed form mapping that describes their specific relationship. This means that in order to satisfy a certain constraint, the values of have to be computed numerically, which can be computationally expensive.
Additionally, although we argue that loosening the restrictions and prior assumptions on the graphical model is possible, the decision-making process as presented in the article is still restricted to a single global utility function that is the driving factor behind the change of behavior of all decision makers involved in the decision-making process. This fact restricts the use of the bounded-rational decision-making models to a certain subset of problems. For example, in game theory, the decision makers can only deal with cooperative or coordination games, where all players receive the same reward for each decision realization. Noncooperative games would require individual preferences with the set of decision-making nodes; that is, the model would have to be modified to allow for local or individual utility functions alongside the global utility function that expresses the preferences of a subset of decision makers or a single decision maker. The incorporation of additional localized utility functions will be a subject for future research.
6.2.1 Maximum Likelihood Estimation and Bayesian Inference
In the context of statistical modeling and inference, maximum (log-)likelihood estimation (MLE) is a popular approach to estimate model parameters by maximizing the likelihood of observed data. Maximum regularized likelihood estimation extends MLE by incorporating regularization terms into the objective function, which help to control the complexity of the model or impose desirable properties (Zhuang & Lederer, 2018). One popular regularizer is the Kullback-Leibler divergence, which discourages large differences between priors and posteriors (Le Cam, 1990; Ichiki, 2023). In the special case where the parameters themselves are modeled as random variables and, moreover, if the regularizer is the Kullback-Leibler divergence between its prior and posterior distributions, then one obtains (variational) Bayesian inference.
6.2.2 Variational Bayesian Inference
In this form, it is apparent that the choice for the utility renders the variational free energy to be identical to the free energy that is maximized in the one-step case from section 3. In particular, the bounded rational posterior , in this case, coincides with the Bayes’ posterior .
Also, note that from a purely technical perspective, removing the term from both optimization objectives reduces the bounded rational decision-making problem to a rational decision-making problem (unconstrained utility maximization) and variational Bayesian inference to standard maximum log-likelihood estimation. Both the rational decision maker and the maximum log-likelihood estimator put all eggs in one basket, making them vulnerable to model misspecification. Thus, the term can simply be considered an entropy regularizer to utility maximization and log-likelihood estimation, respectively.
In summary, variational Bayesian inference has similar ingredients to information-theoretic bounded rationality: log-likelihood plays the role of the utility, and the entropy regularization that turns maximum log-likelihood into Bayesian inference plays the role of informational costs. As utilities can be arbitrary, whereas log-likelihoods are normalized such that they are probabilities when exponentiated, formally, our approach to systems of bounded rational decision-making units might therefore be considered as Bayesian inference using unnormalized log-likelihoods, allowing the description of networks of decision nodes that collaborate to optimize an arbitrary objective. Decision makers in the decision-making system that do not have a direct influence on the utility function serve as hidden signals that filter information for subsequent decision makers.
6.2.3 Message Passing for Decision Making
As outlined in section 2, message-passing algorithms are an invaluable tool in control as inference (Toussaint & Storkey, 2006; Kappen et al., 2012; Todorov, 2008; Toussaint, 2009; Levine, 2018), where actions are treated as state-like variables so that Bayesian inference can be applied for policy search.
In the literature on active inference and the free energy principle, however, message passing is lifted from a purely computational tool to serve as explanation for certain processes in the human brain (Friston et al., 2017). There, variational inference over actions (and other unknown random varibles) is considered under various approximations on the state and action distributions, ranging from rather restrictive assumptions such as the mean-field approach, which assumes statistical independence of all involved variables (Friston, 2009, 2010; Friston et al., 2017), to more capable assumptions such as the Bethe approximation (Schwöbel et al., 2018), which allows for pairwise statistical dependencies. The update equations resulting from optimizing the variational free energies under these approximations are then compared to message-passing equations and discussed in regard of their biological plausibility in terms of explaining inference processes in the brain.
In contrast, our use of message passing is in line with control as inference approaches, namely, as an efficient tool to perform approximate Bayesian inference in graphical models with loops. However, strictly speaking, we use belief propagation only to infer auxiliary distributions required for the effective utilities and priors that determine the action posteriors. Those are assumed to take the form of Boltzmann distributions and are therefore not determined directly using belief propagation (see sections 3 and 4). However, under a more general point of view, one could simply consider the Boltzmann distributions as a specific way to combine incoming messages from the surrounding nodes and factors, analogous to the product of incoming messages in the sum-product algorithm.
7 Conclusion
In this article, we have studied systems of bounded rational decision makers and extended the previous work of Genewein et al. (2015) and Gottwald and Braun (2019) by developing a generalization of the update algorithm that previously restricted the set of possible graphical structures to feedforward architectures, where nodes could only obtain information about other nodes if there was a directed path between them. Here, we argued that we can loosen those restrictions and expand the original algorithm to include arbitrary graphs and update schedules to form general communication structures.
To this end, we combined an inference tool, belief propagation, with bounded rational decision making and showed that increased levels of communication between decision makers can be a substitute for computational capacity whenever decision-makers are limited.
A Appendix
A.1 Interaction between and
A.2 Belief Propagation
A.2.1 Factor Graphs
A.2.2 Sum-Product Message Passing Algorithm
A central task in a multitude of inference problems is to calculate a marginal distribution of a given joint probability distribution where . The sum-product message-passing algorithm is an efficient way to solve this task as it utilizes a (known) factorization of the joint probability. The algorithm is correct whenever the underlying graphical model is a tree and gives good approximate results for graphs that contain cycles (Murphy et al., 2013; Yedidia et al., 2003, 2005).
To describe the message-passing algorithm, we first need to define the messages and their update equations. Let be a factor graph that emerges from the factorization of the joint probability function with . Then, for every edge , the messages and are real-valued functions that describe the relative probability that node is in its different states. The messages are updated according to the following update equations:
- For all :(A.3)
- For all :(A.4)
where is a neighborhood function that maps any node in the factor graph onto its adjacent nodes and is an indicator function that is equal to 1 whenever the component of corresponding to decision node is equal to . Messages are typically initialized as 1 for every message and every possible state of the variables involved. After convergence (and possible normalization) the variable-to-factor message describes the relative probability based on all information available to node , except for information about the factor . The factor-to-variable message describes this relative probability based on the factor .
A.2.3 Beliefs and Marginals
The message functions from the previous section can be used to generate beliefs for every node in the graph .
A.3 Overview: Free Energies in This Article
A.3.1 Free Energy from Constraints
A.3.2 Variational Free Energy
Formally, in this shape, the variational free energy can be considered a special case of the free energy from constraints with utility and , corresponding to a particular trade-off of maximizing the log-likelihood while being biased toward the prior. However, this formal relationship should not be overstated, as here the actual goal is to approximate Bayes’s posterior , whereas, above, the goal was to optimize expected utility with a resource constraint.
A.3.2 Bethe Free Energy
A.3.3 Belief Propagation in Loopy Graphs
To quickly outline the justification of why belief propagation might be applied to loopy graphs, mentioned in section 3.3, we note that many efficient iterative algorithms for approximate inference, especially the sum-product algorithm, can be viewed as a variational free energy minimization (Gottwald & Braun, 2020). In contrast to the free energy from constraints, namely, the objective in the one-step optimization case from section 3 that is derived from the constrained utility optimization problem, equation 3.1, the variational free energy associated with a corresponding factor graph of distribution emerges from the introduction of the belief probability distribution about the distribution and is defined as , where is the variational average energy and is the variational entropy. Minimization of the variational free energy with respect to the set of all possible distributions guarantees for all realizations . The Bethe approximation to the variational free energy arises from the restriction of the possible belief distributions to a certain class of distributions representing a certain region-based approximation where regions are defined as either a single factor node together with its neighborhood or single-variable nodes. As shown in Yedidia et al. (2005) the messages on the far right-hand side of expression 3.16 correspond to the stationary points of the Bethe approximation, meaning they are a suitable candidate for the approximation of marginal distributions.
A.4 Message Passing for Decision-Making Problems
A.4.1 Transforming Graphical Models into Factor Graphs
A.4.2 Prior Probabilities
Calculating the prior probability for a decision maker comes down to determining the marginal distribution of the decision maker conditioned on states of its prior selecting nodes (see equation 3.10). This leads to two different cases, where either or for a nonempty set .
In the former case it is sufficient to run the belief propagation algorithm (see equations A.3 and A.4) on the unmodified factor graph that corresponds to the current state of posterior distributions within the Blahut-Arimoto algorithm. Upon convergence, the belief calculated for decision maker as given by equation A.5 is a good estimate of the true marginal distribution .
The converged belief propagation messages for each such modified factor graph then results in a belief . Combining all beliefs then yields .
A.4.3 Posterior Probabilities
Equations 3.8 and 3.9 from section 3 and 4.2 and 4.3 from section 4 outline that calculating the posterior probability mainly consists of evaluating the effective utility at specific realizations of a subset of variables involved in the decision-making process.
A.5 Pseudo-Codes
This section contains pseudo-codes for the Blahut-Arimoto-type algorithms described in sections 3 and 4 (see algorithm 1). Here, we use the function names , and to refer to the equations and techniques displayed in the main text. Additionally, we use the Boolean function to differentiate between constant nodes that do not process information and regular information processing nodes.
A.6 Counterexamples
Here, we point out that the techniques and tools used in algorithm 2 may lead to contradictions as exemplified by the two graph instances illustrated in Figure 11.
A.6.1 Invalid Posterior Probabilities
Consider an example of a small graph for which we can determine a condition under which posterior probabilities are invalid.
Let be a graph on two Boolean decision-making nodes and with domains . Furthermore, let contain two directed edges, one from to and one from to , making the graph a minimal cycle (see Figure 11 for an illustration). Assume that a modified version of the Blahut-Arimoto algorithm found the posterior probabilities and a joint probability according to equation 3.11.
Example: Unrealizable Belief Distributions. In this example, which was first pointed out in a discussion document titled “A Conversation about Bethe Free Energy and Sum-Product” (MacKay, 2001) it is demonstrated that the sum-product algorithm can converge to a set of beliefs that cannot be the marginals of any joint distribution.
Acknowledgments
This study was funded by the European Research Council (ERC-StG-2015-ERC Starting Grant, Project ID: 678082, “BRISC: Bounded Rationality in Sensorimotor Coordination”).