## Abstract

We consider the problem of the evolution of a code within a structured population of agents. The agents try to maximize their information about their environment by acquiring information from the outputs of other agents in the population. A naive use of information-theoretic methods would assume that every agent knows how to interpret the information offered by other agents. However, this assumes that it knows which other agents it observes, and thus which code they use. In our model, however, we wish to preclude that: It is not clear which other agents an agent is observing, and the resulting usable information is therefore influenced by the universality of the code used and by which agents an agent is listening to. We further investigate whether an agent that does not directly perceive the environment can distinguish states by observing other agents' outputs. For this purpose, we consider a population of different types of agents talking about different concepts, and try to extract new ones by considering their outputs only.

## 1 Introduction

If we consider organisms capable of processing information, then we can argue that they must be able to internally assign meaning to the symbols they perceive in a code-based manner [10]. For instance, bacteria perceive chemical molecules in their environment and interpret them in order to better estimate environmental conditions and (stochastically) decide their phenotype [24, 1, 23, 27]. Plants detect airborne signals released by other plants, and are able to interpret them as attacks of pathogens or herbivores [13, 29]. Therefore, a correspondence between environmental conditions and chemical molecules must be established. It is in this way that Barbieri characterizes codes, and he proposes three fundamental characteristics for them: They connect two independent worlds; they add meaning to information; and they are community rules [2].

Codes connect two independent worlds by establishing a correspondence, or mapping, between them. These worlds are independent, and thus there are no material constraints for establishing arbitrary mappings. The meaning of information comes exclusively from the mapping: Symbols by themselves are meaningless. Finally, the third property requires that the correspondence between the two worlds constitute an integrated system.

For instance, human languages establish a correspondence between words and objects [2]; in bacteria it is between chemical molecules and environmental and social conditions [35, 36]. Words (or chemical molecules) by themselves do not have any meaning, and each individual of a population can define, arbitrarily to some extent, its own set with its mapping. However, populations of individuals sharing the same code are ubiquitous in nature. How is it that codes come to be shared by many individuals when their constitution involves arbitrary choices for each individual? This question is what we are investigating in the present article.

For this work, we assume a simple scenario where organisms live in a fluctuating environment. If they can perfectly predict the future environmental conditions, they can prepare themselves by adopting a proper phenotype, and therefore survive. However, when uncertainty about the environment remains, organisms will follow a bet-hedging strategy [31, 28], where they try to maximize their long-term growth rate by adopting the phenotype that matches the environment in proportions based on the information they have about it. For example, seeds of annual plants germinate stochastically in different periods in relation to the probability of rainfall, and their chances of survival are maximized when they match this probability [5].

The relation between information and long-term growth rate can be expressed elegantly in information-theoretic terms, where an increase in the environmental information of an organism is translated into an increase in its long-term growth rate [30, 17, 18, 8, 26]. Such models achieve the maximization of the long-term growth rate by maximizing an organism's information about the environment. If we assume this behavior in organisms, then those that obtain additional environmental information (other than that from their sensors, which we assume does not completely eliminate environmental uncertainty) from other individuals will have an advantage over those that do not, since they will be able to better predict the future conditions. However, for individuals to be able to communicate with each other, they must be able to translate symbols into environmental conditions, where the output of these symbols results from an individual's code. We consider the code of an individual as a stochastic mapping from its sensors' states to a set of outputs.

For this study, we consider outputs (or messages) of individuals (or agents) as conventional signs. In semiotics, the science of all processes in which signs are originated, stored, communicated, and rendered effective [10], two types of signs are traditionally recognized: conventional signs and natural signs [7]. In conventional signs there is no physical constraint on the possible mappings; they are established by conventions. Although in physical systems there can be limitations to the possible mappings that can be implemented, in this work we assume complete freedom of choice. On the other hand, in natural signs, there is always a physical link between the signifier and signified, such as smoke as a sign of fire, or odors as signs of food [3].

In this work, we are not interested in the particular detailed mechanisms by which an agent implements its code, nor in how the agent decodes the outputs of other agents. Instead, we focus on the theoretical limits on the amount of environmental information an agent can possibly acquire, resulting from different scenarios of population structure and code distribution. The natural framework to analyze such quantities is information theory [30]. However, it does not take semantic aspects into account; it only deals with frequencies of symbols, not what they symbolize. Codes, on the other hand, add meaning to information, which makes the integration of sciences such as semiotics with information theory nontrivial [9, 4]. In the following section, we present an information-theoretic model that incorporates the necessity of conventions by dropping from the model the usual implicit assumption of knowing the identity of the communicating units.

## 2 Model

To introduce the model in a progressive manner, let us first consider three agents, θ1, θ2, and θ3. Each of these agents depends on the same environmental conditions for survival, which are modeled by a random variable μ. Agents acquire information about the environment through their sensors, which are modeled by random variables Yθ1, Yθ2, and Yθ3, all three conditioned on μ, for agents θ1, θ2, and θ3, respectively. We assume each agent acquires the same amount and aspects of environmental information from μ, that is, p(Yθ1 | μ) = p(Yθ2 | μ) = p(Yθ3). Let us further assume that the information each agent acquires about the environment does not eliminate its uncertainty, that is, H(μ | Yθi) > 0 for 1 ≤ i ≤ 3. The code of an agent is a stochastic mapping from its sensor states into a set of outputs, and is represented by the conditional probabilities p(Xθ1 | Yθ1), p(Xθ2 | Yθ2), and p(Xθ3 | Yθ3) for agents θ1, θ2, and θ3, respectively (see Figure 1).

Figure 1.

Bayesian network representing the relationship between the sensor and output variables of three agents.

Figure 1.

Bayesian network representing the relationship between the sensor and output variables of three agents.

Let us assume that agent θ1 perceives only the outputs of agents θ2 and θ3. One possible way of computing the information about the environment agent θ1 has is to consider the mutual information between μ and the joint distribution of the sensor of θ1 and the outputs of θ2 and θ3: I(μ; Yθ1, Xθ2, Xθ3). However, by writing down this quantity, we are implicitly assuming that agent θ1 knows which output corresponds to θ2 and which output corresponds to θ3. Therefore, in this consideration, an agent can theoretically do the translations of the outputs according to some internal model of other agents and infer the mentioned amount of information about its environment.

### 2.1 Indistinguishable Sources of Messages

For this study, on the contrary, we consider an agent observing other agents' messages, but under the assumption that the originator of a message cannot be identified. In this way, the total amount of information an agent can infer from the outputs of other agents will depend on to what extent it either can identify who the other agents are or can rely on them using a coding scheme that does not depend too much on their particular identity. For instance, if agents θ2 and θ3 both agree on the output for each of the environmental conditions, then agent θ1 should be able to infer more environmental information than if they disagree on the output for each of the environmental conditions, given that agent θ1 does not know which of the agents it is observing.

To model this idea, let us assume a random variable Θ′ denoting the selected agent. This agent depends on the same environmental conditions for survival as θ1, which are modeled, as above, by a random variable μ. Agents acquire information about the environment through their sensors, which are modeled by a random variable YΘ′ conditioned on the index variable denoting the agent under consideration, Θ′, and μ. The amount of acquired sensory information of a specific agent θ′ about μ is given by I(μ; Yθ′). As above, the code of an agent is a stochastic mapping from its sensor states into a set of messages, and is represented by the conditional probability p(Xθ′ | Yθ′) for an agent θ′ (see Figure 2).

Figure 2.

Bayesian network representing the relationships described in the text.

Figure 2.

Bayesian network representing the relationships described in the text.

However, now we want to model the fact that we do not know which agent is observed. In the case with maximum uncertainty, Θ is uniformly distributed, and then this parametrization of the codes considers the outputs of all agents in Θ′ together, so that if we are not observing Θ′, we cannot identify whose agent's output we are observing. For sensor states of agent θ1 defined by
and sensor states of agents θ2 and θ3 defined by
we show two example of codes for agents θ2 and θ3:
We compute how much information about the environment there is when the outputs of both agents (θ2 and θ3) are considered together by agent θ1.

If we assume p2) = p3) = 1/2, p1) = p2) = 1/2, and ϵ = 0.01, then if we consider the codes shown in Equation 3, we have that I(μ; Yθ1, XΘ′) = 0.97872 bits, where Θ′ consists of agents θ2 and θ3. However, if θ2 and θ3 have opposite codes as shown in Equation 4, then I(μ; Yθ1, XΘ′) = 0.9192 bits, which is exactly I(μ; Yθ1), that is, I(μ; XΘ′ | Yθ1) = 0 bits (agent θ1 cannot acquire any side information from the outputs of agents θ2 and θ3). We should note here that the sensor states y1 and y2 of agents θ2 and θ3 in the conditional probability shown in Equations 1 and 2 refer almost deterministically to the same environmental condition, and therefore the loss of side information is entirely due to the incompatible codes. The conditional probabilities of sensor states given the environmental conditions further defined throughout the article are also assumed to be almost deterministic.

### 2.2 Environmental Information of a Population

The model shown in Figure 2 considers the environmental information of agent θ1, ignoring its own output Xθ1. Nevertheless, agents' ignoring their outputs is contrary to our assumption of the incapability of agents to identify the sources of the outputs. On the other hand, we are assuming a specific type of communication, one that could be classified as persistent among the different classes of stigmergy ([37, 33, 22]; see [14] for a summary). To incorporate this option in the model shown in Figure 2, we could consider the state space of Θ′ as the set {θ1, θ2, θ3}. Then, to express not only the environmental information of agent θ1, but the average environmental information of the whole population, we can parametrize the agent by a random variable Θ (defined over the same state space, representing the same set of agents as Θ′), such that p(YΘ|μ, Θ) = p(YΘ′|μ, Θ′) (i.e., YΘ′ is i.i.d. with respect to YΘ, and vice versa).

In this way, the average environmental information of a population of the agents selected by Θ is given by I(μ; YΘ, XΘ′) (see Figure 3). This measure can be considered as the objective function to maximize in our model. However, we would be making two important assumptions: First, this objective function assumes agents have access to the environmental conditions μ, which they indirectly do, but only through their sensors; and second, every agent must perceive the output of every other agent, including itself. In this work, instead, we propose that each agent displays a behavior that maximizes the similarity of its outputs (via their codes) to those that the agent perceives. A consequence of this behavior is that the average information about μ is also maximized. In addition, we will introduce a potentially flexible population structure, so that we can specify which agents interact with which.

Figure 3.

Bayesian network representing the sensor variables of a set of agents indexed by the random variable Θ, and the sensor and output variables of a copy of the set of agents indexed by Θ named Θ′.

Figure 3.

Bayesian network representing the sensor variables of a set of agents indexed by the random variable Θ, and the sensor and output variables of a copy of the set of agents indexed by Θ named Θ′.

### 2.3 Code Similarity

First, we introduce a copy of the codes of the agents, such that when we instantiate the variables XΘ and XΘ′, the probabilities are the same. The structure of the population is then given by p(Θ, Θ′) = p(Θ) p(Θ′). However, the conditional independence of Θ and Θ′ restricts significantly the diversity of the structures that can be represented. In such cases, the agents selected by Θ perceive the outputs of all the agents selected by Θ′ and vice versa. In order to model a general interaction structure between agents, we consider p(Θ, Θ′) not independent, as shown in the Bayesian network in Figure 4, where we introduce a helper variable Ξ. This allows different agents selected by Θ to perceive outputs from exclusive agents selected by Θ′.

Figure 4.

Bayesian network representing the relationship of the variables in the model of code evolution. YΘ′ is an i.i.d. copy of YΘ, and XΘ′ is an i.i.d. copy of XΘ. Here Θ′ covers the same set of agents as Θ, but its probability distribution is not necessarily the same.

Figure 4.

Bayesian network representing the relationship of the variables in the model of code evolution. YΘ′ is an i.i.d. copy of YΘ, and XΘ′ is an i.i.d. copy of XΘ. Here Θ′ covers the same set of agents as Θ, but its probability distribution is not necessarily the same.

We define the objective function as I(XΘ; XΘ′), that is, the average code similarity of a population of agents according to the population structure p(Θ, Θ′). For instance, if the interaction probability of two agents is zero, then the similarity of the codes of these two agents is irrelevant for the objective function. If, on the other hand, they interact with probability bigger than zero (p(θ, θ′) > 0 for some agents θ and θ′), then how similar their codes are will influence I(XΘ; XΘ′).

If we consider our system as a process in time, then at each time step two agents are chosen according to p(Θ, Θ′). Agent Θ reads the output of agent Θ′ (generated via its code, which is i.i.d. over time); let us assume that it stores the pair (YΘ, XΘ′), that is, its current sensor state together with the perceived output. If this is repeated a large number of times, then the total amount of environmental information that can be inferred from the collected statistics by the population is bounded by I(μ; YΘ, XΘ′). This is the theoretical limit to which we refer in the introduction, and for this study we are not interested in how the inference is computed. However, we implicitly assume that agents decode the perceived outputs according to their codes.

### 2.4 Distance between Two Codes

In order to visualize the evolution of codes, we define the distance between the codes of two agents θi and θj as the square root of the Jensen–Shannon divergence [40, 19] between them:
where . This measure has the property that 0 ≤ JSDi, θj) ≤ 1 when log2 is used, and the square root yields a metric. Let us note that this distance requires the sensor states Y to be named identically (for the corresponding states of μ) among agents in order to be meaningful. As we stated above, this is (nearly) the case in all our experiments. This requirement over the sensor states precludes the possibility of using other measures such as mutual information.

## 3 Methods

To illustrate the behavior of our model, we consider four different scenarios, which are described in Section 4. The common parameters for the first two experiments are the following: The population consists of 25 agents; the amount and quality of the acquired sensory information are the same for every agent, that is, p(YΘ|μ) = p(YΘ′|μ). For the third scenario, the only difference is that we consider only 15 agents, since the number of dimensions to consider with a flexible structure grows quadratically with the number of agents.

The optimization algorithm used in the following experiments is CMA-ES (covariance matrix adaptation evolution strategy), which is a stochastic derivative-free method for nonlinear optimization problems [12]. We utilized the implementation provided by the Shark library v3.0.0 [15] with its default parameters, which implements the CMA-ES algorithm described in [11]. The evolutionary algorithm used for optimization does not intend to represent the actual evolution of the codes. Instead, we are interested in the results of this optimization process, which are representative of the possible outcomes of evolution.

To visualize the evolution of the codes of the agents, we use the method of multidimensional scaling provided by R version 2.14.1 (2011-12-22). This method takes as input the distance matrix between codes and plots them in a two-dimensional space, preserving the distances as well as possible. To visualize not only the distances between the resulting codes, but also how they relate to the distances between initial codes, we provide a distance matrix of both initial and resulting codes. The initial codes are randomly set by the evolutionary algorithm.

## 4 Results

In this section, we analyze the outcome of the four different scenarios where code similarity is maximized. While the outcomes are particular for one simulation, they are illustrative of the richness that the model is able to capture, which is described for each scenario. The outcomes are typical solutions, and we cannot perform statistics over simulations, since the many solutions are qualitatively different. However, the outcome of each scenario is presented together with a description of alternative outcomes, giving indicators of achievement of a local or global optimum.

### 4.1 Well-Mixed Population

In the first scenario, each agent θi perceives the output of every other possible agent θj with the same probability, that is, pi, θj) = 1/252 for every i, j ∈ [1, 25]. The maximum average code similarity is bounded by I(YΘ; YΘ′) = 1.71908 bits, which is achieved under two conditions: First, every code must be a one-to-one mapping; second, the code must be universal. This is indeed the outcome of the performed optimization, as we show in Figure 5: The optimized codes (blue points) converged into a universal code (the distance between any pair of them is zero). Each red (diamond) point corresponds to an initial code.

Figure 5.

Two-dimensional plot of code distance: Red points are codes at the beginning of the optimization process; blue points are codes at the end of the optimization process (where the distance between every pair of codes is zero).

Figure 5.

Two-dimensional plot of code distance: Red points are codes at the beginning of the optimization process; blue points are codes at the end of the optimization process (where the distance between every pair of codes is zero).

The resulting code adopted by the population is a one-to-one mapping between sensor states and outputs, and any of the 24 possible one-to-one mappings is a global maximum (there are four sensor states and four possible outputs). However, it is still interesting to briefly analyze the possible paths to a universal and optimal code. In Figure 6, we show the distribution of the codes adopted by the agents of the population in an iteration of the optimization process where the average code similarity is I(XΘ; XΘ′) = 1.18276 bits. Here, the most popular code is the suboptimal code shown in Figure 6a. This results from the particular initialized codes, driving the agents temporarily towards a suboptimal code. However, once any of the many-to-one codes becomes (nearly) universally adopted, then any code's deviation improving the code similarity will eventually drive the convention towards optimality. The fact that it does not need simultaneous changes in the code increases the likelihood of improving the code similarity.

Figure 6.

Representation of the codes p(x|y) by a heatmap using inverse grayscale. For each evolved code, we output the number of agents adopting it. This code distribution was achieved with 25 agents in a well-mixed population.

Figure 6.

Representation of the codes p(x|y) by a heatmap using inverse grayscale. For each evolved code, we output the number of agents adopting it. This code distribution was achieved with 25 agents in a well-mixed population.

### 4.2 Spatially Structured Population

In another setup, we assume the agents are structured in a 5 × 5 grid, where p(θ, θ′) = 1/105 if θ and θ′ are neighbors or if θ = θ′ (see Figure 7 for a representation of the structure). After randomly initializing the codes, the performed optimization plateaued on an average code similarity of I(XΘ; XΘ′) = 1.13536 bits. As in the previous scenario, here the optimal solution is also a universal code with a one-to-one mapping. In this case, however, the result is not a universal code, as can be appreciated in Figure 8. Spatially structured populations are sensitive to the initial codes and how codes are updated.

Figure 7.

Representation of the spatial structure utilized for the experiment. Agents are assumed to be distributed in a grid: An edge from one agent to another means that one agent perceives the output of the other. Agents are labeled (see Figure 9) and colored according to their adopted code.

Figure 7.

Representation of the spatial structure utilized for the experiment. Agents are assumed to be distributed in a grid: An edge from one agent to another means that one agent perceives the output of the other. Agents are labeled (see Figure 9) and colored according to their adopted code.

Figure 8.

Two-dimensional plot of code distance: Points with diamond shape represent codes at the beginning of the optimization process; round points represent codes at the end of the optimization process. The points are colored in order to be able to relate this plot to Figure 7.

Figure 8.

Two-dimensional plot of code distance: Points with diamond shape represent codes at the beginning of the optimization process; round points represent codes at the end of the optimization process. The points are colored in order to be able to relate this plot to Figure 7.

The resulting code distribution among the population is shown in Figure 9, with eight different codes in the population. Where well-mixed populations evolved the use of common codes, agreement on codes only occurred among neighbors in spatially structured populations. As a consequence, many local conventions are established within neighborhoods, and, once this situation is reached, the improvement of the total code similarity requires simultaneous changes to the agents' codes. For instance, the code shown in Figure 9e could increase the average similarity of the population if p(x2|y1) = 1, as it is in the rest of the codes. However, for this to happen (in this particular case), at least two agents need to change their code simultaneously (otherwise the average similarity decreases), which makes the deviation from the resulting code distribution unlikely.

Figure 9.

Representation of the codes p(x|y) by a heatmap using inverse grayscale. For each evolved code, we output the number of agents adopting it. This code distribution was achieved with 25 agents in a grid structure.

Figure 9.

Representation of the codes p(x|y) by a heatmap using inverse grayscale. For each evolved code, we output the number of agents adopting it. This code distribution was achieved with 25 agents in a grid structure.

### 4.3 Flexible Population Structure

For the third scenario, we let the structure coevolve with the codes, without any constraint (the probability distribution of the interaction between agents, p(Ξ), is optimized together with the codes). In this case, the resulting average code similarity is nearly optimal, but the code is not necessarily universal. This is because, when the structure is not fixed, agents form roughly disconnected clusters of related codes. In this process, the interaction probability of agents with unrelated codes will vanish. However, once a cluster is formed, if it is not a single isolated agent (such that no other agent perceives its output), then codes of agents are universal within it. This is exemplified by the code distribution and population structure we obtained (see Figure 10). Here, we have two clusters with universal codes, one optimal (in red) and the other suboptimal (in yellow). Agents with codes dissimilar to those of every other agent they interact with will become isolated in the optimization process, as the example shows for two agents (light and dark blue).

Figure 10.

Each node in the graph corresponds to the code of an agent. There is a weighted edge between agents θi and θj if pi, θj) > 0 (which is the weight). We omit weights of edges in the graph, since they all are roughly of similar value. The temperature colors on top of the nodes indicate the amount of environmental information they would contribute to any agent perceiving only that agent's output.

Figure 10.

Each node in the graph corresponds to the code of an agent. There is a weighted edge between agents θi and θj if pi, θj) > 0 (which is the weight). We omit weights of edges in the graph, since they all are roughly of similar value. The temperature colors on top of the nodes indicate the amount of environmental information they would contribute to any agent perceiving only that agent's output.

To summarize, the optimal code similarity equals I(YΘ; YΘ′), and is achieved, for instance, when all agents adopt the same one-to-one mapping. Nevertheless, the interaction probability allows agents to form disconnected clusters of related codes, where several one-to-one mappings could result while still achieving optimality. Theoretically, we could have as many one-to-one mappings as the lesser of the number of agents and the total number of one-to-one mapping combinations (24 in this case).

### 4.4 Emerging Concepts in a Well-Mixed Heterogeneous Population

So far, we have only considered populations of agents that acquired the same aspects of information from μ (i.e., p(Yθi | μ) = p(Yθj for any pair of agents θi, θj). The assumption was that the information that was relevant for the survival of the agents was the same among the agents of the population, and this was represented by μ. Now, we consider a more general scenario, where different types of agents acquire different aspects from the environmental conditions μ. We investigate whether it is possible for an agent that does not directly perceive the environment at all (we call this type of agent blind) to predict conditions solely from the outputs of other agents. We consider a well-mixed population, such that different types of agents are forced to talk to each other. Considerations with a flexible population structure are not interesting for our purposes, since in these cases, each type of agent forms a cluster disconnected from clusters of other types. This was confirmed by simulations, which are not shown here.

Let us illustrate the idea with a relatively simple scenario: We consider five types of agents (we denote the ith type by ϕi), where each type can only distinguish whether the current state of the environment belongs to its colored region or not. The environment consists of nine states, and the probability of each state is uniformly distributed. We illustrate this environment by a 3 × 3 grid, as shown in Figure 11, although the square does not denote the physical structure of the environment. Then, the outputs of each type of agent will be related to the regions they capture. For instance, for agents of type ϕ2 with the same deterministic code, if Pr(μ ∈ {1, 2, 4, 5} | Xθ = x) equals one (for all θ of type ϕ2), then x will signify that this agent is currently in the region colored red in Figure 11. We say that a population of agents has a joint concept of the environment if, by considering its representation of the environmental information they capture, we can obtain information about the environment, that is, we require that I(μ; XΘ) > 0. For instance, the symbol x in the example above, assuming that it is only utilized by agents of the same type, can be understood as representing the concept top left of the grid.

Figure 11.

Representation of the conditional probabilities p(Yθ|μ) for an agent θ of each type. These are defined so that each type of agent can only distinguish between the colored region and the white region. For instance, the sensor of type ϕ2 is defined as Pr(Y = y1|μ) = 1 if μ ∈ {1, 2, 4, 5}, and zero otherwise, and Pr(Y = y2|μ) = 1 if μ ∉ {1, 2, 4, 5}, and zero otherwise. For type ϕ1, Pr(Y = y1|μ) = 0.5 and Pr(Y = y2|μ) = 0.5 (|Y| = 2 for all types of agents).

Figure 11.

Representation of the conditional probabilities p(Yθ|μ) for an agent θ of each type. These are defined so that each type of agent can only distinguish between the colored region and the white region. For instance, the sensor of type ϕ2 is defined as Pr(Y = y1|μ) = 1 if μ ∈ {1, 2, 4, 5}, and zero otherwise, and Pr(Y = y2|μ) = 1 if μ ∉ {1, 2, 4, 5}, and zero otherwise. For type ϕ1, Pr(Y = y1|μ) = 0.5 and Pr(Y = y2|μ) = 0.5 (|Y| = 2 for all types of agents).

The amount of environmental information that an agent θ of type ϕ1 (a blind agent) captures is I(μ; Yθ) = 0 bits, while all agents θ of the other types capture I(μ; Yθ) = 0.991076 bits (note that the total entropy in μ to be resolved is H(μ) = 3.16993 bits). Throughout this study, we considered that agents predict the environment by considering their perceptions together with the outputs of other agents. The blind agent, instead, since it is not able to capture any direct cue from μ, we consider capable of perceiving the outputs of both of the agents selected by Θ and Θ′. With this relaxed consideration, we say a blind agent has a concept of the environment if I(μ; XΘ, XΘ′) > 0, that is, we consider the maximum amount of information an agent can possibly infer from the joint outputs XΘ and XΘ′.

Let us recall that the structure of the population is well mixed, and thus the distribution of outputs of all agents is considered, including the blind ones, which are not able to express (via their outputs) any particular concept by themselves (for a blind agent θ, we have I(μ; Xθ) ≤ I(μ; Yθ) = 0, i.e., I(μ; XΘ) vanishes). Therefore, whether a blind agent has some concept of the environment will depend, first, on the universality of the codes of each type of agent (agents representing the same information with different symbols may create ambiguities), and second, on the cardinality of the alphabet of X (denoted by |X|) utilized by the population. A small alphabet will force agents to represent different concepts of the environment with the same symbols, while a large alphabet is likely to result in exclusive representations of concepts for each type of agent.

Taking this into account, we ask, is it possible for a blind agent to identify concepts of the environment? If so, how are these concepts related to the concepts of the individual agents (other than the blind ones)? Is the size of the available alphabet related to the quality of the concepts?

To study these questions, we performed different experiments varying the size of the alphabet, |X|, where the rest of the parameters remained the same. In these experiments, we optimized the similarity of codes for a population composed of 20 agents, with four agents of each of the five types. In Table 1 we show that the cardinality of the alphabet of X affects the limit of the amount of information a blind agent can possibly infer about the environment.

Table 1.

Results of experiments where the size of the alphabet of a population varies.

|X|I(μ; XΘ, XΘ′)
0.34621
0.56555
0.71620
0.95467
1.08139
1.18362
1.30919
1.30919
|X|I(μ; XΘ, XΘ′)
0.34621
0.56555
0.71620
0.95467
1.08139
1.18362
1.30919
1.30919

Notes. The maximum amount of environmental information that a blind agent can infer is achieved with |X| = 8 and remains equal for bigger alphabets. As the size of the alphabet decreases, this information also decreases.

Now, if we measure the uncertainty of the environment for a blind agent for each combination of outputs XΘ and XΘ′, we find that for some of them, it is zero. For instance, with |X| = 7, we found that when Pr(μ = 5|XΘ = 1, XΘ′ = 2) = 1.0 (see Figure 12, where only combinations with XΘXΘ′ are shown). These distributions are also valid on swapping the values of XΘ and XΘ′, since in the well-mixed population the structure is symmetric. Looking at the example of the conditional probability in Figure 12, we can find many other concepts, although none of them—apart from the one already discussed—can uniquely identify a state of the environment. For instance, we have that Pr(μ|XΘ = 3, XΘ′ = 6) = 0.33 when μ ∈ {3, 5, 7}, which is a concept for being on a particular diagonal of the environment.

Figure 12.

Conditional probability p(μ|XΘ, XΘ′) in inverse grayscale. Each row represents a combination of values of XΘ and XΘ′, and each column represents a state of μ.

Figure 12.

Conditional probability p(μ|XΘ, XΘ′) in inverse grayscale. Each row represents a combination of values of XΘ and XΘ′, and each column represents a state of μ.

In Figure 13 we show the resulting codes (which are universal for each type, including the blind one) for this particular experiment. Here, the types ϕ2 (red) and ϕ5 (orange) utilize the same symbols to represent different environmental conditions. By using a small size of the alphabet for X, we force ambiguities in the population, but these will be chosen (by evolution) such that they are minimal. In this way, we maximize the amount of information we can infer from the outputs (although this can be a local optimum). For instance, the outputs of the blind agents (type ϕ1) for all the experiments never overlapped those of other types (unless we use |X| = 2, where there is no choice). In other words, blind agents always choose one symbol so that they minimize the number of utilized symbols from the whole population.

Figure 13.

Representation of codes p(XΘ|YΘ, Θ) by a heatmap using inverse grayscale for the experiment with |X| = 7. For each node, the rows represent a sensor state y, while the columns represent an output state x. The colors on top of the nodes are used to distinguish the type of agent to which the code belongs, and are related to the colors in Figure 11.

Figure 13.

Representation of codes p(XΘ|YΘ, Θ) by a heatmap using inverse grayscale for the experiment with |X| = 7. For each node, the rows represent a sensor state y, while the columns represent an output state x. The colors on top of the nodes are used to distinguish the type of agent to which the code belongs, and are related to the colors in Figure 11.

In all the performed experiments, we found that for values of |X| ≥ 6, the blind agent can perfectly predict the environmental state μ = 5 for at least one combination of outputs XΘ and XΘ′. Interestingly, this new concept, which in this particular experiment can be called the center of the world or environment, cannot be obtained by looking to individual concepts only.

## 5 Discussion

We considered four different scenarios of code evolution; in the first one, all agents perceived the outputs of all other agents, including itself. We argued that two main stages of evolution can be recognized: In the first stage, a universal code is established, which can be optimal or not. If it is not optimal, then a second stage will achieve optimality. The same result was obtained in [34], in a model of the evolution of the genetic code (represented as a probabilistic mapping between codons and amino acids), although universality and optimality were simultaneously achieved.

In the mentioned work, which developed further the ideas of [38, 39], the authors argue that the universality of the genetic code is a consequence of early communal evolution, mediated by horizontal gene transfer (HGT) between primitive cells. In this evolutionary process, they argue, larger communities will have access (through the exchange of genetic material) to more innovations, leading to faster evolution than smaller ones. Then, “it is not better genetic codes that give an advantage but more common ones” [34, p. 3]. Although their model does not explicitly show this property, it is captured in our model. We show that a more common, but not optimal code is widely adopted within a population (see Figure 6). However, in our model, a code imposes itself as universal not because it provides access to more innovations (in our model there is no “code exchange”; only the outputs are shared), but because the population structure forces the adoption of the most popular code. After this stage, further changes in the code of the agents eventually lead to optimality.

In another related work, [21] explored the origins of language in a scenario consisting of artificial agents with coupled perception and production of speech sounds. Although this work is focused on plausible mechanisms for the origin of language, it assumes the same similarity principle as we do (hearing a vocalization increases the probability of producing similar vocalizations), arriving at the same outcome (a universal language, or code). Other works have considered similar principles in the evolution of languages: for instance, the naming game [32] and the imitation game [6]. However, these models assume some common conventions in order to evolve new ones. In this study, our main assumption was that the population of agents depended on common environmental conditions.

Our second scenario, where the structure of the population is a grid, showed how establishing local conventions in early stages of evolution constrains the outcome of the code distribution, since to reconcile different conventions, several simultaneous changes are needed. On the other hand, in our third scenario, where we let the structure of the population change simultaneously with the codes themselves, such situations are avoided by disconnecting clusters with dissimilar conventions. This property enhances evolution, and can potentially lead to the adoption of several different conventions within an increasingly fragmenting, or speciating, population.

Our last scenario assumed perceptual constraints on the environmental information of each agent, and we looked at emerging concepts within a well-mixed population. This scenario was studied in [20], where, as well as in our study, new conceptualizations of the world emerged as a result of considering together the concepts of every agent. In both studies, the new concept was not representable individually by any agent. Differently from the mentioned study, the new concepts obtained in our study were the result of a simple similarity maximization principle, while in the work of [20], concepts were obtained through the modeling of an explicit fitness function.

The evolution of conventional codes could be interpreted, in the widest sense, as a form of cultural evolution. For instance, considering the definition of culture given by [25, p. 5], “Culture is information capable of affecting individuals' behavior that they acquire from other members of their species through teaching, imitation, and other forms of social transmission,” it could be argued that a form of cultural information is present in organisms, such as bacteria or plants. Although there is a dependence among the different dimensions on which information is transmitted in organisms (if we assume the dimensions to be, for instance, genetic, epigenetic, behavioral, and symbol-based, as proposed by [16]), our model assumes freedom of choice in one dimension, without direct influence on the others.

Finally, communication between individuals of a population opens up the possibility of “signal cheaters,” which could be either individuals that do not produce signals themselves but still perceive those of the others, or individuals who exploit other individuals' learned responses to symbols to their advantage. However, our model does not allow such behavior, since the code producing the outputs functions, implicitly, as the interpreter of the perceived signals.

## 6 Conclusion

In the proposed model, we introduced a key assumption that allowed us to evolve, for some structures, universal and optimal codes. This assumption states that an agent cannot distinguish the sources of the outputs it perceives from other agents. Following from this, a universal code will necessarily introduce semantics by relating symbols to environmental conditions (via the internal states of the agent). Our model proposes an information-theoretic way of measuring the similarity within a population of codes.

In this work, we proposed, as an evolutionary principle, that agents try to maximize their side information about the environment indirectly by maximizing their mutual code similarity. This behavior produces several interesting outcomes in the code distribution of a structured population. Depending on the population structure, it captures the evolution of a universal and optimal code (well-mixed population structure), and also the evolution of different codes organized in clusters (in a freely evolving structure), which allows the establishment of optimal as well as suboptimal conventions.

Finally, we considered a well-mixed heterogeneous population with perceptual constraints on the agents about the environment, and showed how, just by looking at the outputs of agents, it is possible to extract concepts that relate to the environment, concepts that none of the agents of the population could individually represent.

## References

1
Balázsi
,
G.
,
Van Oudenaarden
,
A.
, &
Collins
,
J. J.
(
2011
).
Cellular decision making and biological noise: From microbes to mammals
.
Cell
,
144
(
6
),
910
925
.
2
Barbieri
,
M.
(
2003
).
The organic codes: An introduction to semantic biology
.
Cambridge, UK
:
Cambridge University Press
.
3
Barbieri
,
M.
(
2008
).
Biosemiotics: A new understanding of life
.
Die Naturwissenschaften
,
95
(
7
),
577
599
.
4
Battail
,
G.
(
2009
).
Applying semiotics and information theory to biology: A critical comparison
.
Biosemiotics
,
2
(
3
),
303
320
.
5
Cohen
,
D.
(
1966
).
Optimizing reproduction in a randomly varying environment
.
Journal of Theoretical Biology
,
12
(
1
),
119
129
.
6
De Boer
,
B.
(
2000
).
Self-organization in vowel systems
.
Journal of Phonetics
,
28
(
4
),
441
465
.
7
Deely
,
J.
(
2006
).
On semiotics as naming the doctrine of signs
.
Semiotica
,
2006
(
158
),
1
33
.
8
Donaldson-Matasci
,
M. C.
,
Bergstrom
,
C. T.
, &
Lachmann
,
M.
(
2010
).
The fitness value of information
.
Oikos
,
119
(
2
),
219
230
.
9
Favareau
,
D.
(
2007
).
The evolutionary history of biosemiotics
. In
M.
Barbieri
(Ed.),
Introduction to biosemiotics
(pp.
1
68
).
Berlin, Heidelberg
:
Springer
.
10
Görlich
,
D.
,
Artmann
,
S.
, &
Dittrich
,
P.
(
2011
).
Cells as semantic systems
.
Biochimica et Biophysica Acta
,
1810
(
10
),
914
923
.
11
Hansen
,
N.
, &
Kern
,
S.
(
2004
).
Evaluating the CMA evolution strategy on multimodal test functions
. In
X.
Yao
,
E. K.
Burke
,
J. A.
Lozano
,
J.
Smith
,
J. J.
Merelo-Guervós
,
J. A.
Bullinaria
,
J. E.
Rowe
,
P.
Tiňo
,
A.
Kabán
, &
H.-P.
Schwefel
(Eds.),
Parallel problem solving from nature—PPSN VIII
(pp.
282
291
).
Berlin, Heidelberg
:
Springer
.
12
Hansen
,
N.
, &
Ostermeier
,
A.
(
2001
).
Completely derandomized self-adaptation in evolution strategies
.
Evolutionary Computation
,
9
(
2
),
159
195
.
13
Heil
,
M.
, &
Karban
,
R.
(
2010
).
Explaining evolution of plant communication by airborne signals
.
Trends in Ecology & Evolution
,
25
(
3
),
137
144
.
14
Heylighen
,
F.
(
2011
).
Stigmergy as a generic mechanism for coordination: Definition, varieties and aspects (working paper)
.
Vrije Universiteit Brussel
.
15
Igel
,
C.
,
Heidrich-Meisner
,
V.
, &
Glasmachers
,
T.
(
2008
).
Shark
.
Journal of Machine Learning Research
,
9
,
993
996
.
16
Jablonka
,
E.
, &
Lamb
,
M. J.
(
2005
).
Evolution in four dimensions, revised edition: Genetic, epigenetic, behavioral, and symbolic variation in the history of life
.
Cambridge, MA
:
MIT Press
.
17
Kelly
,
J.
(
1956
).
A new interpretation of information rate
.
IEEE Transactions on Information Theory
,
2
(
3
),
185
189
.
18
Kussell
,
E.
, &
Leibler
,
S.
(
2005
).
Phenotypic diversity, population growth, and information in fluctuating environments
.
Science
,
309
(
5743
),
2075
2078
.
19
Lin
,
J.
(
1991
).
Divergence measures based on the Shannon entropy
.
IEEE Transactions on Information Theory
,
37
(
1
),
145
151
.
20
Möller
,
M.
, &
Polani
,
D.
(
2008
).
Common concepts in agent groups, symmetries, and conformity in a simple environment
.
Artificial Life
,
11
,
420
427
.
21
Oudeyer
,
P.
(
2005
).
The self-organization of speech sounds
.
Journal of Theoretical Biology
,
233
(
3
),
435
449
.
22
Parunak
,
H. V. D.
(
2006
).
A survey of environments and mechanisms for human-human stigmergy
. In
D.
Weyns
,
H. V. D.
Parunak
, &
F.
Michel
(Eds.),
Environments for multi-agent systems II
(pp.
163
186
).
Berlin, Heidelberg
:
Springer
.
23
Perkins
,
T. J.
, &
Swain
,
P. S.
(
2009
).
Strategies for cellular decision-making
.
Molecular Systems Biology
,
5
(
1
),
326
.
24
Platt
,
T. G.
, &
Fuqua
,
C.
(
2010
).
What's in a name? The semantics of quorum sensing
.
Trends in Microbiology
,
18
(
9
),
383
387
.
25
Richerson
,
P. J.
, &
Boyd
,
R.
(
2005
).
Not by genes alone: How culture transformed human evolution
.
Chicago
:
University of Chicago Press
.
26
Rivoire
,
O.
, &
Leibler
,
S.
The value of information for populations in varying environments
.
Journal of Statistical Physics
,
142
(
6
),
1124
1166
.
27
Schuster
,
M.
,
Sexton
,
D. J.
,
Diggle
,
S. P.
, &
Greenberg
,
E. P.
(
2013
).
Acyl-homoserine lactone quorum sensing: From evolution to application
.
Annual Review of Microbiology
,
67
,
43
63
.
28
Seger
,
J.
, &
Brockmann
,
H. J.
(
1987
).
What is bet-hedging?
Oxford Surveys in Evolutionary Biology
,
4
,
182
211
.
29
Shah
,
J.
(
2009
).
Plants under attack: Systemic signals in defence
.
Current Opinion in Plant Biology
,
12
(
4
),
459
464
.
30
Shannon
,
C. E.
(
1948
).
A mathematical theory of communication
.
Bell System Technical Journal
,
27
,
379
423
.
31
Slatkin
,
M.
(
1974
).
Hedging one's evolutionary bets
.
Nature
,
250
,
704
705
.
32
Steels
,
L.
(
1995
).
A self-organizing spatial vocabulary
.
Artificial Life
,
2
(
3
),
319
332
.
33
Theraulaz
,
G.
, &
Bonabeau
,
E.
(
1999
).
A brief history of stigmergy
.
Artificial Life
,
5
(
2
),
97
116
.
34
Vetsigian
,
K.
,
Woese
,
C. R.
, &
Goldenfeld
,
N.
(
2006
).
Collective evolution and the genetic code
.
Proceedings of the National Academy of Sciences of the United States of America
,
103
(
28
),
10696
10701
.
35
Waters
,
C. M.
, &
Bassler
,
B. L.
(
2005
).
Quorum sensing: Cell-to-cell communication in bacteria
.
Annual Review of Cell and Developmental Biology
,
21
,
319
346
.
36
West
,
S. A.
,
Griffin
,
A. S.
,
Gardner
,
A.
, &
Diggle
,
S. P.
(
2006
).
Social evolution theory for microorganisms
.
Nature Reviews Microbiology
,
4
(
8
),
597
607
.
37
Wilson
,
E. O.
(
2000
).
Sociobiology: The new synthesis
.
Cambridge, MA
:
Harvard University Press
.
38
Woese
,
C. R.
(
2002
).
On the evolution of cells
.
Proceedings of the National Academy of Sciences of the United States of America
,
99
(
13
),
8742
8747
.
39
Woese
,
C. R.
(
2004
).
A new biology for a new century
.
Microbiology and Molecular Biology Reviews
,
68
(
2
),
173
186
.
40
Wong
,
A. K.
, &
You
,
M.
(
1985
).
Entropy and distance of random graphs with application to structural pattern recognition
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
7
(
5
),
599
609
.

## Author notes

Contact author.

∗∗

Adaptive Systems Research Group, University of Hertfordshire, Hatfield, U.K. E-mail: a.c.burgos@herts.ac.uk (A.C.B.); d.polani@herts.ac.uk (D.P.)