The prefrontal cortex (PFC) supports goal-directed actions and exerts cognitive control over behavior, but the underlying coding and mechanism are heavily debated. We present evidence for the role of goal coding in PFC from two converging perspectives: computational modeling and neuronal-level analysis of monkey data. We show that neural representations of prospective goals emerge by combining a categorization process that extracts relevant behavioral abstractions from the input data and a reward-driven process that selects candidate categories depending on their adaptive value; both forms of learning have a plausible neural implementation in PFC. Our analyses demonstrate a fundamental principle: goal coding represents an efficient solution to cognitive control problems, analogous to efficient coding principles in other (e.g., visual) brain areas. The novel analytical–computational approach is of general interest because it applies to a variety of neurophysiological studies.
Flexible cognitive control is fundamental to our everyday activities. It relies on the ability to efficiently learn to extract and fulfill goals, different from habitual decisions that build on stereotyped responses (Dolan & Dayan, 2013). PFC supports goal-directed behavior by biasing response selection based on contextual information, goals, and other task-relevant information or task sets (Frank & Badre, 2012; Passingham & Wise, 2012; Reverberi, Görgen, & Haynes, 2012; Koechlin & Hyafil, 2007; Genovesio, Brasted, & Wise, 2006; Koechlin, Ody, & Kouneiher, 2003; Monsell, 2003; Miller & Cohen, 2001). The neural codes and mechanisms supporting these PFC abilities remain elusive, with contrasting proposals that include mixed selectivity for a large basis of task-related properties (Rigotti et al., 2013) and representations of prospective behavioral goals (Genovesio, Tsujimoto, & Wise, 2012; Yamagata, Nakayama, Tanji, & Hoshi, 2012).
To disentangle these alternatives, here, we simulated three tasks previously used to study monkey prefrontal function (Genovesio et al., 2012): a duration discrimination, a distance discrimination, and a match-to-sample task. To simulate these tasks, we used a probabilistic computational model that fuses unsupervised and value-driven learning. In particular, we used an approximate nonparametric probabilistic category learning method (Sanborn, Griffiths, & Navarro, 2010; Anderson, 1991) to infer from experience a set of candidate categories that guide stimulus–action value transitions and a reward-sensitive process to select the actual category to be used for action control at any given trial (Collins & Frank, 2013; Collins & Koechlin, 2012).
Like in the monkey studies reported in Genovesio et al. (2012), neural representations of prospective goals—or goal codes—emerged in the model as latent statistical categories grouping noisy stimulus–action value contingencies in optimal ways. The analysis of the model behavior demonstrates that the emerged goal codes afforded efficient learning and action selection. To assess if the goal codes replicate the coding properties of PFC neurons in the Genovesio et al. (2012) study—that is, an advance representation of the identity of the to-be-selected target stimulus—we compared the monkey PFC data and the latent states learned by the computational model using an information-theoretic approach based on (conditional) mutual information, which permits assessing their specific coding properties while excluding confounds. The analyses revealed that the goal codes emerged in the model replicate with high accuracy key properties of PFC neurons and the goal information is not confounded with other characteristics of stimuli such as their color or magnitude. The results thus provide a novel mechanistic explanation of how PFC exerts cognitive control by learning prospective goal codes. Furthermore, our results integrate two influential streams of research on PFC functioning that focus on behavioral control (Passingham & Wise, 2012; Miller & Cohen, 2001) and category learning (Seger & Miller, 2010), respectively, showing that these are complementary processes within a nonparametric probabilistic learning system. In a broader perspective, our study points to hierarchical probabilistic inference as a general framework to understand prefrontal function (Pezzulo, Rigoli, & Firston, 2015; Donoso, Collins, & Koechlin, 2014; Friston et al., 2013; Friston, 2010; Doya, Ishii, Pouget, & Rao, 2007; Monsell, 2003).
Recap of the Experimental Procedure of the Target Monkey Experiment
Our test-bed is a set of three monkey tasks (duration discrimination, distance discrimination, and match-to-sample) where goals were implicitly defined by stimuli magnitudes and colors/shapes (Genovesio et al., 2012). In the original monkey experiment, trials began with the sequential presentation of visual context Stimuli S1 and S2 that were either a red square or a blue circle (hereon, we will refer only to their color). Then, each of the two target stimuli coding the identity (color and shape) of S1 and S2 appeared on a video monitor, randomly to the left or to the right of the screen center (Figure 1A). The monkey's task was to touch a switch below the stimulus that previously either lasted longer (duration discrimination task), appeared farther from screen center (distance discrimination task), or appeared twice in the trial (match-to-sample task). In each trial of the duration discrimination task, the duration and the identity of the stimuli varied, but not their distance. In the distance discrimination task, the distance and the identity of the stimuli varied, but not their duration. In the match-to-sample task, the duration of the stimuli varied, but not their identity or distance. (Note that this is a peculiar match-to-sample task because the same sample is presented twice, not once as usual, to have the same number of stimuli presentations as in the duration and distance discrimination tasks.) In the duration and match-to-sample tasks, the stimuli appeared at screen center and lasted 200–1200 msec, varying in steps of 200 msec (six levels in total). In the distance task, the stimuli lasted 1000 msec each, and the distance from the screen center varied from 1.6 to 9.4 visual degrees, in steps of 1.6° (six levels in total). S1 and S2 had equal probability of either lasting longer or shorter in the duration and matching-to-sample tasks; and of being either farther or closer from screen center in the distance task. The period between S2 and the presentation of the targets for the response lasted 400 or 800 msec.
During the training phase, monkeys learned the three tasks in a sequence using a block design, which included first the duration task, then the distance task, and finally, the match-to-sample task, with occasional presentation of blocks containing the previously learned tasks, until adequate performance. When monkeys had average response accuracy of 80% or more in the duration and distance tasks and in the match-to-sample task, the recording period started. In this test period, the monkeys performed the three tasks in blocks with pseudorandom task order, during which dorsolateral and caudal PFC neural-cell activity was registered (with means of n = 192 trials for duration, n = 151 trials for distance, and n = 92 trials for match-to-sample task).
The average neural activity of interest was calculated in the 80- to 400-msec period after a subject could discriminate which was the stimulus with the greatest magnitude; during this period, PFC neurons were found to carry prospectively the information on the goal identity (see Genovesio et al., 2012, for other details).
Computational Modeling Methods
We simulated the experiments using a probabilistic generative model that learns to predict reward based on noisy sensory information—context (Stimuli S1 and S2 with noisy magnitude and color identity properties) and target (the color identities of S1 and S2)—actions (press S1 or S2), and latent states or categories inferred during learning (Figure 1C). We assumed that the latent categories correspond to a population of PFC neurons, where individual neurons have specific preferences (e.g., for a response) and compete for selection. The conditional probability that a given category is selected in a given context corresponds instead to the strength of connectivity between neurons in the latent category population and input neurons encoding context information (Pouget, Beck, Ma, & Latham, 2013). Thus, a given context conditionally drives the selection of the latent category most strongly related with it, which (together with the target stimulus) determines action selection, that is, the selection of the action with the higher probability to secure a reward.
The model includes a context variable that jointly encodes the perceived properties of the Stimuli S1 and S2: distance (six magnitude levels, from closer to farther from the center), duration (six magnitude levels, from shorter to longer), and color (two levels: red and blue). The four magnitude inputs (i.e., the duration and distance of both S1 and S2) were encoded noisily: Continuous Gaussian noise was added to each of the four inputs, and the resulting continuous values were rounded to obtain their final values (six magnitude levels each). A fully orthogonal coding of the context variable would require (6 × 6 × 2)2 levels, but because, in our tasks, only one property (either duration or distance) varies in magnitude in a given trial, we could use a compact coding with only 288 levels.1 The model includes also a target variable that encodes the properties of response-triggering stimuli: color (two levels) and position (two levels: left and right). The target variable was orthogonally coded (four levels). The model finally includes a response variable orthogonally coded to represent the two possible actions of the monkey (left and right response) and a reward variable orthogonally coded to represent two possible outcomes (rewarded and not rewarded).
Description of the Computational Model
The model uses a Bayesian reinforcement learning scheme to update the conditioned probabilities based on the number of successes and failures. The update considers the (Bernoulli distribution of) predicted reinforcement and the actual action outcomes. In parallel, an approximate nonparametric learning method, Dirichlet nonparametric mixture process, shapes the clustering of the contexts into latent categories according to their utility in obtaining reinforcement. First, observed contexts recruit existing clusters according to their overall selection frequency (i.e., popularity) or engage novel ones with a small probability, favoring compact clustering (Donoso et al., 2014; Gershman & Blei, 2012). Second, the conditional probabilities between contexts and categories are scaled depending on the observed stimulus–action value contingencies; in other words, the probability of obtaining a reward contributes to the shaping and selection of the categories (see Equation 2 below). This approximate learning method gradually shapes on the acquisition of latent category units that permit to “model” or “explain” the observed stimulus–action value contingencies and afford efficient reward acquisition (Collins & Frank, 2013; Collins & Koechlin, 2012). The specific method we adopted for the Dirichlet nonparametric mixture process is a local maximum a posteriori inference, developed by Sanborn et al. (2010) for category learning and later applied in the domain of reinforcement learning (Collins & Frank, 2013); this noniterative, local procedure is more biologically realistic compared with alternative (e.g., nonlocal) methods.
In our target monkey task, monkeys had to infer the correct “rule” or “goal” (say, red goal) based on the sequential presentation of two visual stimuli (a red square and a blue circle) and then select a response (press a button) when the two stimuli were successively presented together, in a randomized (left or right) position. In analogy with this situation, in our model, the two earlier presented Stimuli S1 and S1 correspond to context stimuli ct that are categorized into latent states, whereas targets correspond to response-triggering stimuli st (Figure 1C). Importantly, the perceived context stimuli c are clustered into latent states (categories) z according to their utility to obtain a reward. The clustering is defined by a probability distribution P(z|c) and initialized with a nonparametric probabilistic approach: a Dirichlet mixture process also known as Chinese restaurant process (CRP; Gershman & Blei, 2012). According to this popular metaphor, clusters, or categories, correspond to restaurant tables and contexts to customers. A newly experienced context (new customer) cn+1 is assigned to a new cluster (empty table; new category) znew with a small probability P(znew | cn+1) = α/A (controlled by concentration parameter α, here α = 2) or to an old cluster (occupied table; old category) zi according to a measure of its popularity P(zi| cn+1) = ∑jP(zi| cj)/A (table occupancy; category priors) across all contexts, where A is a normalizing factor, A = α + ∑i,jP(zi| cj), that essentially counts the number of experienced contexts. After the initialization, the perceived context ct evokes the most probable context-specific category (category inference) zt= argmaxiP(zi| ct) that in turn conditions the successive action selection process. Critical to the organization of the category structure, the belief in this assignment P(zi| ct) is scaled with the probability of the current reward (caused by the choice) (category learning; Collins & Frank, 2013; Sanborn et al., 2010; see Equation 2 below).
Finally, we calculate the posterior conditioned reinforcement distribution Pt+1 (r = 0, r = 1) by normalizing nt+1r=1 and nt+1r=0 to sum to one (conditional value learning).
Method for Calculating Noise in the Model Stimuli and Responses
The perception of sensory magnitudes such as time and distance is intrinsically noisy (e.g., Tudusciuc & Nieder, 2007), such that the perceived magnitudes vary from trial to trial and the variability wn of the perceived magnitudes scale with the magnitudes according to the Weber–Fechner law (Dehaene, 2003; Whalen, Gallistel, & Gelman, 1999; Gibbon, 1977). The noise causes considerable response variability and hinders the learning of magnitude-dependent tasks because of reduced consistency of the experienced stimulus–action value contingencies. Because magnitude comparison plays a crucial role in the duration and distance tasks, performing a realistic simulation of the monkey experiments requires equipping the computational model with input stimuli having an adequate level of perceptual noise. To this aim, we adopted by a typical point-wise internal noisy magnitude representation, so-called mental number line, characterized by Gaussian noise wn= nw0 that scales with the magnitude n and that is parameterized by a variability coefficient w0 (e.g., Whalen et al., 1999). We then conducted a reanalysis of the monkeys' behavior in the original experiment (Genovesio et al., 2012) to adequately estimate the variability coefficient w0 of the relevant (duration and distance) properties.
The first step of the analysis is the calculation of magnitude discriminability coefficients at the behavioral level, the so-called Weber fraction w (Figure 1B). They were obtained by using the monkey responses in the postlearning test period to build psychometric response profiles for each of the three tasks (Figure 1B). For the duration and distance discrimination tasks, the profiles display the probability of selecting the identity of Stimulus S1 as a function of the log ratio between the magnitude properties of S1 and S2. Consistent with what is reported in the literature, on a log scale, the profile of our magnitude comparison tasks is a symmetric sigmoidal curve (see, e.g., Stoianov & Zorzi, 2012). Essentially, the larger the absolute value of the log ratio between the compared magnitudes, the larger the probability to select the correct stimulus, or in other words, the more different the magnitudes are, the greater are the odds for a correct response. The profiles are summarized by a magnitude discriminability coefficient, a Weber fraction w that describes the slope of the sigmoid: the smaller the w, the more vertical is the slope and the more precise is the response (Pica, Lemer, Izard, & Dehaene, 2004). For the match-to-sample task, the psychometric profile of Figure 1B plots the monkeys' response accuracy as a function of the duration ratio between the two stimuli, which was manipulated in the task but irrelevant to solve it. As expected, the response appears independent of ratio.
An ideal, noise-free perception of magnitude feeding an errorless decision-making system would result in magnitude discriminability coefficient w equal to zero. However, here, we observed relatively large Weber fractions in the duration and distance discrimination tasks (Figure 1B), suggesting that, in the monkey brain, these processes are quite noisy. The moderate error rate in the match-to-sample task, in which magnitude perception is not essential, indicates that other noisy mental processes (e.g., memory storage and implicit task selection) contribute to the variability of response selection in this task and the other task(s). To account for the variability of these nonperceptual processes, we subtracted a noise term σnonperceptual = 0.15 from the calculated behavioral discriminability coefficients w (shown in Figure 1B). This permitted us to estimate the variability coefficients of the internal representations of duration discrimination, w0 = 0.34, and distance discrimination, w0 = 0.33. We used these parameters to generate noisy magnitude stimuli using the aforementioned number-line model——but note that two control simulations reported below show that the learning mechanism is robust to greater levels of noise. Finally, we added constant variability at response selection by randomly alternating 5% of the responses.
Information Theory Measures Used in the Analyses
To compare PFC neurons in the Genovesio et al. (2012) study and the clusters evolved by the computational model, we used information theoretic measures. In particular, we analyzed the information content (in terms of properties of interests such as goals or colors and their combination) conveyed by PFC neurons and the clusters evolved by the computational model.
To apply information measurements to responses with many levels (or continuous values), the responses first need to be discretized at only few levels (e.g., two or three); this procedure is necessary to avoid that sparse observations of multiple response levels artificially distort the information measures (Panzeri, Senatore, Montemurro, & Petersen, 2007). We adopted a simple three-level response discretization procedure based on normalized values: First, we normalized the response by subtracting its mean and dividing by its variance; then, we created three categories separated by levels −0.5 and 0.5. A preliminary entropy-measuring analysis revealed that the three-level discretization increased the overall information content relative to a simpler two-level discretization; furthermore, it did not lose too much information relative to a four-level discretization.
We used the three-level discretization to perform information criteria analyses of two kinds of raw responses: the firing rate of PFC neurons and the selection probabilities of the latent states of the computational model. The properties of interest had already few levels: The goal had two levels (red and blue target), the task had three levels (duration, distance, and match-to-sample), the index of the larger stimulus had two levels (either S1 or S2), and the color of the first stimulus also had two levels (either red or blue). The information theoretic measures we used are introduced below.
Entropy H(x) = −∑ip(xi) log p(xi) is a measure of the overall quantity of information conveyed by response x, and it essentially measures response variability. Here, p(xi) is the probability of each specific response level xi.
Mutual information I(x; s) = ∑i,jp(x, s) log2p(x, s)/p(x)p(s) measures the information carried by the neural response x about a stimulus property s where p(s), p(x), and p(x,s) are the marginal and joint empirical probabilities of the property and the response. Critically, in the case of multiple related properties, it does not measure the specific amount of information carried by each of them. In our task, this might potentially confound the interpretation of goal-coding neurons. Because goals could possibly encode a mixture of color, magnitude, and task information, it is possible that neurons encoding one of these properties carry nonspecific information about the goal.
To rule out such confounds, we used conditional mutual information I(x; s|s′) = ∑kp(s′) ∑i,jp(x, s|s′) log2p(x, s|s′)/p(x|s′)p(s|s′) that measures the amount of information about a property s while controlling for another property s′. Note that relative to I(x;s), I(x;s|s′) can decrease, remain invariant, or increase. The more it decreases, the more the response encodes s by virtue of s′. Following our goal-coding hypothesis, we expected that the mutual information between the response x and the goal property conditioned on all related properties would not drop to (or be close to) zero. Indeed, such a drop would imply that most of the goal-related information is explained away by the property s′; for example, if I(x;goal|color) drops close to zero, then color coding would be a more parsimonious explanation than goal coding. To verify that the conditional information is statistically different than zero, we used the formal nonparametric method of Ince, Mazzoni, Bartels, Logothetis, and Panzeri (2012).
Tuning Functions and Contrastive Preference of Neurons for Stimuli Features
The probabilistic computational model was first trained and then tested in a block design, in keeping with the target monkey experiment protocol. Each block presented pseudorandomly selected patterns from a given task, with noise added as described earlier.
Below, we report the results of the simulations of the monkey experiments (behavioral and neural data) and of several control experiments. All the results reported below are calculated as the average of 30 simulations.
The first critical test for our model was the ability to adequately replicate the monkey behavioral data. Using the monkey protocol and applying perceptual and action-selection variability measures derived from the analysis of the monkey psychometric profile (see the Methods section), we administered the three experimental tasks to 30 replicas of the model, which were learned successfully within a few thousands of trials (Figure 7) despite the considerable noise of the stimulus–action value contingencies. Response accuracy during the test period was equivalent to monkeys' performance: 80% (SE = 0.5%) in the duration task, 80% (SE = 0.5%) in the distance task, and 95% (SE = 0.4%) in the match-to-sample task (reward-driven learning continued also in this test period). At the behavioral level, the model exhibited monkey-like response accuracy, psychometric profiles (per-task correlations, R2 > .88), and magnitude discriminability in the test period (Figure 1B and D).
After an adequate simulation of monkeys' behavior (Figure 1B and D), we performed a neural-level analysis of the way the model obtains flexible control of behavior. In keeping with the goal-coding hypothesis, we predicted that (1) the model would have learned the observed stimuli–action value contingencies by clustering the large number of context stimuli into a small set of latent states or categories and (2) the latent states would correspond to goals (i.e., the identity—here, the color—of the target stimulus that had to be selected). The other possible clustering structures were a common representation of magnitude (e.g., clusters coding for “the larger” and “the smaller” stimuli) or a sensory-oriented encoding, clustering one or a combination of context stimuli properties (e.g., cluster coding for colors).
We found that the computational model consistently used a very limited number of popular categories, each aggregating a large number of contexts (Figure 2A and B). However, this result per se is not yet sufficient to assess the goal-coding hypothesis, that is, that the used latent categories actually carry goal-related information; to this aim, it is essential to analyze what these latent states encode. The next relevant questions were as follows: Are the learned categories purely perceptually driven (Freedman, Riesenhuber, Poggio, & Miller, 2001)? Did they code basic perceptual categories such as stimuli magnitude? Or did they cluster prospective goals or control signals? Did they correspond to the coding characteristics of the monkey PFC neurons studied in the same tasks? We approached these critical questions with an analysis based on (conditional) mutual information measures.
Analyses Using Information Measures
We used information theoretic measures to assess the coding properties of PFC neurons in the Genovesio et al. (2012) study and the clusters created by the model and, in particular, to assess if they code (or carry information on) prospective goals, in keeping with the goal-coding hypothesis. Here, “goal information” indicates a property or a set of properties that are relevant for a (future) choice, for example, whether red or blue should be the choice for the target. In the monkey experiment, there are two possible targets for the choice, so goal information can be measured as a contrastive preference (see Equation 5) between the to-be-chosen target versus the not-to-be-chosen target (e.g., red vs. blue).
We first analyzed all the 324 PFC neurons recorded from the three tasks and all latent categories whose activity conveyed at least 0.8-bits overall information (measured with entropy) and 0.10-bits mutual information about the goal (arbitrary thresholds). This analysis identified n = 117 latent categories (among the 30 replicas of the model) and n = 31 PFC neuron encoding goals (Figure 2C and D; black dots indexing the mutual information conveyed by the unit response about the goal).
However, as explained in the Methods section, the mutual information analysis cannot rule out the possibility that goal coding is “spurious” and confounded by other properties of the stimuli properties. To verify whether these units conveyed genuine goal information, we then calculated the conditional mutual information conveyed by the units about the goals, considering various task-related potential confounds of the goal property (Figure 2C and D; color dots indexing the corresponding conditional mutual information for each unit). As explained in the Methods section, if the mutual information conditioned on a given property substantially decreases (relative to the nonconditional mutual information) and approaches zero, then this property would explain away the information conveyed by the unit about the goals (or, in other words, the units would encode a confounding property, not a goal). However, for all the units identified as goal coding, the conditional mutual information was significantly greater than zero (p < .05 using the method of Ince et al., 2012), thus revealing genuine domain-general representation of prospective goals in both PFC neurons and the latent categories, although noisier in the neural data.
To corroborate this conclusion, we investigated the coding properties of the model units and PFC goal-coding neurons (as identified using the above analysis) by compactly expressing goal coding for each task in terms of contrastive preference (see Equation 5) to respond to the color corresponding to the prospective goal (e.g., “red”) but not the other (e.g., “blue”). Figure 3 shows the scatterplots of preferences, calculated separately for each pair of tasks. Note that the dots representing task sets are exclusively present in the top right and bottom left and lay along the main diagonal. This result indicates that, like in the monkey data, the goal-coding responses are task invariant (i.e., the task sets have the same goal preference in all the three tasks), which corroborates the hypothesis of a domain generality of goal codes. Note that the coding properties of goal codes are different from (and cannot be explained in terms of) simpler color-selective—or color-coding—responses because the neurons/units were only active when the color they were selective for corresponded to the behavioral goal (see Figure 4). This result corroborates the hypothesis that latent categories are true goal codes rather than having a mere preference for sensory properties like color or magnitude.
In a further analysis, we also investigated the coding properties of the 20 most frequently selected clusters in every replica of the model and, in particular, if they encoded one or more of the following properties: goal coding (Figure 4A and B); order-based magnitude preference, by calculating the contrastive preference for greater first (S1 > S2) and second (S1 < S2) stimuli magnitude (Figure 4D); and color preference of one of the context stimuli, S1, by calculating the contrastive preference for “red” and “blue” S1 stimuli (Figure 4G).
Once again, we found that the goal-coding preference was task invariant (Figure 4A and B), thus providing strong evidence about its domain generality. Moreover, and consistent with monkey data, no latent category units exclusively encoded other stimuli dimensions such as a common representation of magnitude across tasks (Figure 4D).
Overall, the emerged goal codes were not purely stimulus related but constituted a task-relevant abstraction: a possible way PFC might convert order- and feature-based stimulus codes into a domain-general but goal-specific code that affords efficient action selection.
Analysis of the Emergent Goal-coding Mechanism
To better understand the principle of goal coding, we analyzed and showed in Figure 5 the emergent neural-level mechanism of goal coding and response selection in one sample learner, whose latent-state selection is shown in Figure 2A. At the top is the probability P(zi|cj) of selecting each of the three most popular latent states zi for each combination of context stimuli cj properties (color and duration/distance of Stimuli S1 and S2), separated in six different panels by the implicit stimuli-dependent task and color of S1. The bottom of the figure shows the probability P(r = 1|sk, al, zi) of obtaining reward given each combination of target stimuli sk (red-blue or blue-red) and possible action al (press left or press right), separately for each of the three most popular latent states zt whose selection preference is shown above. This probability is used by the learner to select the action that brings reward.
Thus, upon the (noisy) perception of Context stimuli S1 and S2, this learner would preferentially select the first latent state if the stimuli were perceived as having red goal (i.e., larger or longer red stimulus or red color-matching stimuli) and the second or third latent state if a blue goal was instead perceived (i.e., larger or longer blue stimulus or blue color-matching stimuli). The selected latent state conditions the successive response selection upon target appearance. For example, if the red goal was selected (i.e., the first latent state is active) and the target stimuli correspond to “red to the left, blue to the right,” the learner would press the left button to obtain reward.
The simulations thus far replicated the monkey data in the specific test-bed conditions at both the behavioral and neural levels. In addition, we have conducted a set of control simulations, with two aims: (1) demonstrating the generality, robustness, and scalability of our computational methods and (2) generating novel empirical predictions—a paramount feature of computational modeling, whose results should extend beyond the mere replication of existing data.
Effects of Perceptual Noise
An important question is whether the proposed computational scheme is robust to various levels of noise in magnitude perception or, in other words, whether it permits extracting appropriate goals with high or low levels of noise. We investigated this issue in two control simulations. In one, the noise was half of that used in the main simulation, whereas in the other, the noise was doubled. In general, we expected to observe corresponding changes in behavioral discriminability but not qualitative differences such as the impossibility for the model to extract goal codes (unless of course the model experiences ceiling effects of noise, which would preclude learning in general, not only goal coding).
As predicted, much smaller perceptual variability (w0 = 0.17 for both distance and duration) largely improved magnitude discriminability at the behavioral level (duration: w = 0.28, SE = 0.01; distance: w = 0.27, SE = 0.01), and neural-level analysis revealed the same control mechanism based on goal coding (Figure 6A–C). Note that this simulation would represent a closer approximation of an experiment with adult human participants, whose magnitude (or analog number) processing system is generally more precise than that of monkeys.
The result of the second control simulation (with higher levels of noise) exceeded our expectations. The subjects learned to identify and select correct targets despite very large perceptual noise (w = 0.68 for both distance and duration) and, consequently, with much worse magnitude discriminability relative to the main simulation (duration: w = 0.99, SE = 0.039; distance: w = 0.89, SE = 0.034). Importantly, a neural-level analysis revealed that, even in this case, the model extracted goals during the first two noisy tasks (Figure 6D–F), thus demonstrating the robustness of the goal-coding principle.
Redundancy of Goal Coding
We then verified the role of the concentration parameter α that controls the probability for a newly experienced context to evoke a new latent state (“table” in the Chinese Restaurant) or to join some of the popular latent states. Control simulations with various levels of this parameter (α = 1, 2, 5, 10) revealed that the performance, sensitively measured with the behavioral Weber fraction w, was essentially unaffected by α. At the same time, the number of exploited latent states increased along with α, as expected. However, for a given level of α, the same (popular) latent states were used in all three tasks, which, together with contrastive preference analysis of these latent states as that in Figure 3, corroborated the finding in the main simulation that, once a specific (goal-coding) role of a latent state is established, it remains invariant in the successfully learned tasks, which in turn is critical for transfer learning (see the following). Finally, using the information theory analysis introduced earlier, we also found that the goal-coding specificity was essentially the same as in the main simulation. Thus, the concentration parameter alpha essentially controls the redundancy of goal representation, but it does not change the goal-coding principle.
Goal-coding Selection Criterion
We further verified the impact of the specific selection criteria on our novel neural-level analysis. Putting a threshold on entropy was necessary to ensure that the analysis focused only on units having a reasonable amount of response variability. Halving the threshold, from 0.80 to 0.40, left unchanged the pool of goal-coding PFC neurons and extended the pool of latent states by only five units, obtaining essentially invariant results. We expected a more significant impact of the mutual-information selection criteria that specifically sought goal-coding units. Indeed, doubling it, from 0.10 to 0.20, restricted the selected pool to only n = 9 PFC neurons and slightly decreased the number of latent states. Expectedly, these very specific goal-coding units showed more clearly dichotomous contrastive goal-preference distribution. On the contrary, halving the mutual information criterion, from 0.10 to 0.05, extended the pool to n = 53 PFC neurons and slightly increased the number of latent states. As expected, this less-specific pool had slightly less clear dichotomous distribution of contrastive goal preferences.
Effects of Block Design versus Interleaved Design
The main simulation presented the stimuli in block design, but we hypothesized that the same goal-based control mechanism would emerge without task blocking as well. To test this prediction, we conducted a control simulation in which the tasks were learned using interleaved design by pseudorandomly selecting task type in each trial; all the rest was kept invariant. The result confirmed the prediction. Relative to the main simulation, magnitude discriminability only slightly increased (duration: w = 0.54, SE = 0.015; distance: w = 0.50, SE = 0.012), but goal coding and the associated control strategy were consistently found (Figure 6G–I).
Transfer Learning and the Differences between Standard (Flat) Reinforcement Learning and Our Proposed (Structured) Method
Classical reinforcement learning methods (e.g., temporal difference learning) are sufficient to learn the correct policy in our tasks, provided that each pair of context-target stimuli is observed several times. However, we hypothesized that our proposed (structured) method based on a nonparametric component—which essentially extracts goal-to-response mappings—would have been advantageous when learning novel tasks that share similarities with the already-acquired ones (i.e., transfer learning), pointing to a specific adaptive value of structured models and goal coding in PFC.
To verify this prediction, we ran a control simulation using a flat generative probabilistic model of reward P(rt| st, at, ct), with the same binomial distributions as in the main model. As hypothesized, the flat model successfully learned the tasks. However, the flat model showed a slow learning process that has the same trend for each new task, with poor or no generalization. This is in contrast with the learning trend of the structured model that instead reuses its knowledge to learn faster each novel task (Figure 7). Thus, a key advantage of the nonparametric component is the predisposition to build and reuse already-acquired goal contingencies across different domains and in novel situations, which is consistent with a role of PFC in supporting one-shot learning and providing behavioral flexibility without catastrophic forgetting (Koechlin & Hyafil, 2007; Shima, Isoda, Mushiake, & Tanji, 2007; Doya, 1999).
Unsupervised Clustering and the Role of Supervised Category Learning
To show the critical role of categorization driven by the reinforcement signal, we performed another control simulation in which the latent units were initialized according to the Dirichlet mixture process (as in the hierarchical model used in the main simulation) but not further trained to account for the conditions bringing to reward (i.e., not applying Equation 2). The procedure, parameters, number of replicas, and learning schedule were the same as in the main simulation. We expected that the prior CRP would randomly associate the input context stimuli with a limited number of active latent variables, providing no useful internal representation of the context and thus producing low performance. Indeed, as shown on Figure 7, this model was not able to learn the tasks and responded at chance level, further emphasizing the importance of value-driven learning in shaping a behaviorally relevant categorization process.
Scalability of the Computational Learning Approach
To assess the scalability of the proposed method to more challenging experimental conditions, we generalized the setup by increasing the number of colors and available responses. The identity of each Context stimulus S1 and S2 were randomly drawn among k colors, and the target stimulus was a random permutation of all available identities (colors). As in the main simulation, the action consisted in selecting the identity of the stimulus that (as in the main simulations) lasted longer was more distant, or matched the context stimuli, but this time, was located in one of k possible positions and among all other identities. Note that the complexity of the task increases dramatically along with k. For k = 3, there are 648 context stimuli and six target displays, whereas for k = 4, there are 1152 context stimuli and 24 target displays.
The model successfully learned also these two problems, after a higher number of learning trials reflecting the increasing complexity (k = 3, 20,000 trials; k = 4, 50,000 trials). As in the main simulation, the reinforcement learning procedure used a goal-coding strategy to solve the tasks, and the goal-response mappings created in the first phase greatly simplified the learning of the second and third tasks (Figure 8), further supporting the generality of the proposed approach.
PFC lies at the apex of the brain control hierarchy (Fuster, 1997) and is uniquely positioned to integrate context, reward, and control-related information and to learn their (noisy) contingencies. This gives PFC great flexibility in supporting goal-directed behavior but also implies that it has to solve complex, multidimensional learning and selection processes. From a statistical viewpoint, this problem can be finessed by learning a hierarchical generative model that links stimuli, actions, and rewards to internal (“hidden” or “latent”) states or categories, which need to be inferred, too (Friston, 2010). In keeping with this idea, in tasks requiring subjects to learn a large set of actions, PFC was found to develop abstractions or categories of actions that permit to guide the behavior (Shima et al., 2007); however, other kinds of categories have been reported in PFC such as object categories (Freedman et al., 2001), leaving open the question of what kind of categories better supports goal-directed behavior. Here, we tested the idea that goal coding constitutes a solution to the problem faced by PFC: learning abstract categories that are useful to steer goal-directed action and cognitive control. We hypothesized that goal representations (or goal codes)—here, prospective representations of the to-be-selected target stimuli—emerge in PFC as “latent states” (or categories) of a generative model that clusters relevant statistical properties of stimuli and value information and successively bias response selection toward goal-relevant outcomes.
To test the hypothesis, we used a probabilistic generative model that combines unsupervised nonparametric learning (for latent state learning and categorization) and reinforcement learning (to guide the categorization toward task-relevant abstractions). Nonparametric Bayesian networks have flexible structure allowing learning rich internal representations of complex data (Ghahramani, 2013). Previously, approximate nonparametric learning successfully developed categories (Sanborn et al., 2010), and nonparametric value-driven learning was used to build task sets (Collins & Frank, 2013; Collins & Koechlin, 2012), supporting the viability of the method.
The test-bed for our simulation was a series of studies reported in Genovesio et al. (2012), where the experimenters collected monkeys' dorsolateral and caudal PFC single-cell data during the postlearning period and reported goal-coding cells common to all three tasks. The results of our computational simulations and information analyses successfully replicated these data, and beyond that, they showed that goals emerge as latent task dimensions that encode behaviorally relevant task regularities and stimuli properties, thus offering a normative explanation for the domain-general representation of prospective goals found in the monkey PFC (Genovesio et al., 2012; Yamagata et al., 2012).
Furthermore, despite that the CRP produces a very high number of latent states and usually selects (or “populates”) a logarithmic function of the number of experienced context stimuli (Gershman & Blei, 2012), we found that our model—which uses reinforcement signals in combination with CRP—consistently uses very few of them. This result is consistent with the idea that, although sets of broadly tuned PFC neurons might provide a “basis” or “repertoire” to execute a variety of cognitive control tasks, each specific task might critically depend on a smaller set of cells that have highly selective and task-specific (e.g., goal-related) properties. Future studies looking at the dynamics of PFC representations during learning might permit testing this hypothesis and studying if the learning process benefits from the putatively “critical” properties to be already present in a PFC “repertoire” (transfer learning) and/or from adaptive coding: the ability of PFC neurons to flexibly adapt their properties to convey task-relevant information (Duncan, 2001).
The close matching of the model and the data at both behavioral and neural levels, and the results of our control simulations in more challenging experimental conditions, support our hypothesis that PFC goal coding might be a fundamental organizing principle for efficient flexible control. Furthermore, our novel analyses based on information theory measures corroborate the goal-coding hypothesis by ruling out the possibility that the neuronal coding of goals was confounded by other task-related features.
Key to our results is the combination of two forms of learning: An unsupervised category learning process extracts relevant behavioral abstractions from the input data, but the selected categories are sculpted by a value-driven process according to their adaptive value (see Equation 2). Both forms of learning have been extensively reported in PFC (Frank & Badre, 2012), but they are typically studied in isolation. Our proposal thus brings an integrative perspective that reconciles two influential streams of research on prefrontal function that focus on behavioral control (Passingham & Wise, 2012; Miller & Cohen, 2001) and category learning (Seger & Miller, 2010), respectively.
Furthermore, at difference with most (model-free) reinforcement learning models of cognitive control that use direct stimulus–response mappings and in which goals are implicitly encoded in a value function of states and actions (O'Reilly, Herd, & Pauli, 2010; Botvinick, Niv, & Barto, 2009; Dayan, 2009; Sutton & Barto, 1998), in our method, goals are explicitly coded. Explicit goal representations are a characteristic feature of most model-based probabilistic architectures for goal-directed behavior, such as planning-as-inference (Pezzulo, Rigoli, & Chersi, 2013; Pezzulo, 2012; Solway & Botvinick, 2012) and active inference (Clark, 2013; Friston, 2010), where they have a key role in guiding action selection and control (Lepora & Pezzulo, 2015; Pezzulo, van der Meer, Lansink, & Pennartz, 2014; Pezzulo, Verschure, Balkenius, & Pennartz, 2014; Verschure, Pennartz, & Pezzulo, 2014; Pezzulo & Castelfranchi, 2009). Our model complements these proposals by offering a mechanistic explanation of how their required goal representations might be learned in the first place. Furthermore, our results suggest that encoding the prospective goal might simplify cognitive control tasks by permitting splitting them into two distinct phases, goal identification and target selection, and to carry on only limited information (the target identity) from the former to the latter. In this perspective, an advance representation of the identity of the to-be-selected target stimulus is an efficient way to encode context information in cognitive control tasks, which couples accuracy (it permits the model, or the monkey, to respond adequately when the target appears) and parsimony (only the identity of the target stimulus needs to be remembered, not its other features such as its magnitude).
The results of our study parallel a body of evidence in human neuroscience that shows the relevance of nonparametric methods to understand human learning and cognitive control (Donoso et al., 2014; Collins & Frank, 2013; Collins & Koechlin, 2012). Reassuringly, all these complementary research streams show that the same set of computational methods can apply to a variety of data obtained using different techniques, single-cell neurophysiological responses in monkeys and human fMRI or EEG data. The convergence of results in these computational studies suggests that some of the benefits of the nonparametric model, such as its usefulness for transfer learning, might have general application as they have been reported in previous simulations (Collins & Frank, 2013) and confirmed here in a very different setup.
Our study also points to the importance of using appropriate state or task representations for solving cognitive control tasks. The Wilson, Takahashi, Schoenbaum, and Niv (2014) study established a role for the OFC in state representations but did not address the problem of how to learn them. Here, instead, we discuss the (nonparametric) computational mechanisms that permit learning state representations that encode prospective goals and mapping these mechanisms to single-cell properties of the monkey PFC.
We verified the robustness, generality, and scalability of the obtained results using various control learning simulations. First, we showed that the result is robust with respect to the level of perceptual noise. To this aim, we applied the same behavioral protocol of the main simulation, but we introduced either half or double perceptual noise. We found that accuracy correspondingly increased or decreased, as expected, but the same goal-coding principle emerged (Figure 6A–F). We then assessed whether the goal coding we found was specific to the block-design task presentation or if it also emerged using a more ecologically valid design in which multiple tasks are interleaved. To this aim, we applied a behavioral paradigm in which the tasks were presented in an interleaved design, that is, they were pseudorandomly selected across all learning trials. The results demonstrated that the same goal-coding strategy emerged and guided the behavior demonstrating the generality of the approach (Figure 6G–I). Finally, we verified whether the goal-coding principle would scale beyond simple dichotomous choices. We designed a more challenging task with multiple possible goals in which the identity (i.e., color) of Context stimuli S1 and S2 was pseudorandomly selected among k colors, and all the k colors were presented in random order as target stimuli. Simulations with three and four target colors resulted in successful learning, and the analyses revealed that, also in these more challenging situations, the behavior was guided by emergent goals (Figure 8).
Overall, the control simulations indicate that the goal-coding principle extends beyond the specific conditions of our reference monkey neurophysiological study (Genovesio et al., 2012) and applies to various more challenging conditions, demonstrating the scalability of the nonparametric value learning approach to situations that include contingencies between stimuli, actions and rewards, event when those are very noisy and presented in variable order. For example, the large-noise control simulations explain how infants with not fully developed perceptual system could nevertheless robustly extract implicit goals despite very noisy internal stimuli representations (e.g., Feigenson, 2011). More generally, the control simulations correspond to novel empirical predictions that remain to be tested by future research.
The current model has also some limitations, and in particular, it eschews the full complexity of PFC responses in cognitive control tasks. For example, Genovesio, Tsujimoto, Navarra, Falcone, and Wise (2014) report that, in one of the tasks studied here, PFC neurons carry information that is not related to the current trial (e.g., information about past goal and outcomes). This information was irrelevant—in fact, the monkeys were not required to maintain that information in memory to correctly perform the task—and future studies are needed to assess whether this information is used for action selection or other functions such as monitoring. Despite so, this evidence raises the intriguing issue that the carry-on of information from one trial to another might be used to learn the long-term statistics of the task or its try-by-trial structure, which would require an extension of the current model.
The novel analytical–computational approach adopted in this study is of general interest because it applies to a variety of neurophysiological studies. The nonparametric Bayesian approach we used (Gershman & Blei, 2012) affords efficient approximate learning of complex nonlinear latent features within a probabilistic generative framework (Sanborn et al., 2010), providing an excellent vehicle for neural-level analysis like connectionist neurocomputational modeling (Stoianov & Zorzi, 2012). We framed the proposed learning procedure at a high, so-called “computational” analysis. However, plausible biological implementations have been proposed for the belief propagation methods that we used for the inference (Legenstein & Maass, 2014; Doya et al., 2007), along with approximate inference methods that permit addressing larger state spaces (Friston, 2010). The overall nonparametric approach has a viable biological implementation, too and points to hierarchical statistical learning in prefrontal hierarchies (Frank & Badre, 2012; Friston, 2008; Koechlin & Summerfield, 2007) shaped by reinforcement-related signals through prefrontal–(ventral) BG loops (O'Reilly & Frank, 2006). Collins and Frank (2013) showed that the nonparametric Dirichlet process used here can be neurally implemented and has good quantitative fits to the behavior produced by a neural network (in which the sparseness of the connectivity matrix from contexts to PFC was linked to the alpha clustering parameter), although the mapping is not exact. Exploring the detailed biological mechanisms underlying the proposed nonparametric model is an open objective for future research.
Summary and Conclusion
We report a computational study suggesting that goal coding at the single cell level represents an efficient solution to cognitive control problems: It permits selecting among the available actions based on the current task and goal contingencies (Passingham & Wise, 2012; Miller & Cohen, 2001), has low memory requirements, and permits to learn faster novel tasks (transfer learning) by aggregating novel unseen contexts to context categories learned in previous tasks and thus reusing existing sensory-motor strategies (Figures 2A and 6). Goal coding might be a fundamental organizing principle of PFC (Passingham & Wise, 2012; Koechlin et al., 2003), analogous to efficient coding principles in other (e.g., visual) brain areas, and one of its neural signatures might be the modulation of PFC tuning profiles depending on task-relevant rules (Stokes et al., 2013).
This study was supported by the European Commission's Seventh Framework Programme (grant no. FP7 270108 to G. P. and PIEF-GA-2013-622882 to I. S.) and the Human Frontier Science Program (grant no. RGY0088/2014 to G. P.). The GEFORCE Titan used for this research was donated by the NVIDIA Corporation. G. P. and I. S. conceived the experiments. A. G., G. P., and I. S. discussed the results and wrote the paper. I. S. wrote and ran the code and analyzed data.
Reprint requests should be sent to Giovanni Pezzulo, Institute of Cognitive Sciences and Technologies, National Research Council, Via S. Martino della Battaglia 44, 00185 Rome, Italy, or via e-mail: firstname.lastname@example.org.
To obtain the context variable, the magnitude properties and the color properties of S1 and S2 were combined, using mixed multiplicative–additive coding. First, for each property, duration, and distance, the magnitudes of S1 and S2 were multiplicatively combined in an index with 6 × 6 = 36 levels. Second, the two (duration and distance coding) indexes were additively combined in an overall magnitude index with 36 + 36 = 72 levels; the rationale for the additive combination is that, in our tasks, only one among duration or distance varied. Third, the color properties of S1 and S2 were multiplicatively combined in an index with 2 × 2 = 4 levels. Finally, the overall representation of the context input was built by multiplicatively combining the magnitude and the color properties in an index of 72 × 4 = 288 levels.