## Abstract

Primate vision is characterized by constant, sequential processing and selection of visual targets to fixate. Although expected reward is known to influence both processing and selection of visual targets, similarities and differences between these effects remain unclear mainly because they have been measured in separate tasks. Using a novel paradigm, we simultaneously measured the effects of reward outcomes and expected reward on target selection and sensitivity to visual motion in monkeys. Monkeys freely chose between two visual targets and received a juice reward with varying probability for eye movements made to either of them. Targets were stationary apertures of drifting gratings, causing the end points of eye movements to these targets to be systematically biased in the direction of motion. We used this motion-induced bias as a measure of sensitivity to visual motion on each trial. We then performed different analyses to explore effects of objective and subjective reward values on choice and sensitivity to visual motion to find similarities and differences between reward effects on these two processes. Specifically, we used different reinforcement learning models to fit choice behavior and estimate subjective reward values based on the integration of reward outcomes over multiple trials. Moreover, to compare the effects of subjective reward value on choice and sensitivity to motion directly, we considered correlations between each of these variables and integrated reward outcomes on a wide range of timescales. We found that, in addition to choice, sensitivity to visual motion was also influenced by subjective reward value, although the motion was irrelevant for receiving reward. Unlike choice, however, sensitivity to visual motion was not affected by objective measures of reward value. Moreover, choice was determined by the difference in subjective reward values of the two options, whereas sensitivity to motion was influenced by the sum of values. Finally, models that best predicted visual processing and choice used sets of estimated reward values based on different types of reward integration and timescales. Together, our results demonstrate separable influences of reward on visual processing and choice, and point to the presence of multiple brain circuits for the integration of reward outcomes.

## INTRODUCTION

Primates make approximately three to four saccadic eye movements each second, and thus, the choice of where to fixate next is our most frequently made decision. The next fixation location is determined in part not only by visual salience (Itti & Koch, 2000) but also by internal goals and reward expected from the foveated target (Schütz, Trommershäuser, & Gegenfurtner, 2012; Markowitz, Shewcraft, Wong, & Pesaran, 2011; Navalpakkam, Koch, Rangel, & Perona, 2010). Brain structures known to be involved in the control of saccadic eye movement have been extensively studied as a means of understanding the neural basis of decision-making (Sugrue, Corrado, & Newsome, 2005; Glimcher, 2003). Interestingly, the same structures also appear to contribute to the selective processing of targeted visual stimuli that tend to accompany saccades (Squire, Noudoost, Schafer, & Moore, 2013). Thus, it is conceivable that reward outcomes and expected reward (i.e., subjective reward value) control saccadic choice and processing of targeted visual stimuli via similar mechanisms.

Our current knowledge of how reward outcomes and subjective reward value influence the processing of visual information and saccadic choice comes from separate studies using different experimental paradigms. For instance, the effects of reward on saccadic choice are studied using tasks that involve probabilistic reward outcomes (Farashahi, Azab, Hayden, & Soltani, 2018; Chen & Stuphorn, 2015; Strait, Blanchard, & Hayden, 2014; Liston & Stone, 2008; Platt & Glimcher, 1999) as well as tasks with dynamic reward schedules (Costa, Dal Monte, Lucas, Murray, & Averbeck, 2016; Donahue & Lee, 2015; Schütz et al., 2012; Lau & Glimcher, 2007; Barraclough, Conroy, & Lee, 2004; Sugrue, Corrado, & Newsome, 2004), both of which require estimation of subjective reward value. In contrast, the effects of reward on the processing of visual information have been mainly examined using tasks involving unequal expected reward outcomes without considering the subjective valuation of reward outcomes (Rakhshan et al., 2020; Barbaro, Peelen, & Hickey, 2017; Hickey & Peelen, 2017; Anderson, 2016; Hickey, Chelazzi, & Theeuwes, 2010, 2014; Anderson, Laurent, & Yantis, 2011a, 2011b; Della Libera & Chelazzi, 2006, 2009; Peck, Jangraw, Suzuki, Efem, & Gottlieb, 2009). More importantly, none of the previous studies has explored the effects of reward on choice and processing of visual information simultaneously. As a result, the relationship between these effects is currently unknown.

Understanding this relationship is important because the extent to which reward influences sensory processing could impact decision-making independently of the direct effects of reward on choice. For example, in controlled decision-making paradigms or natural foraging settings, recent harvest of reward after saccade or visits to certain parts of the visual field or space could enhance processing of features of the targets that appear in those parts of space, ultimately biasing choice behavior. Such an influence of reward on sensory processing could have strong effects on choice behavior during tasks with dynamic reward schedules that require flexible integration of reward outcomes over time (Bari et al., 2019; Farashahi, Donahue, et al., 2017; Farashahi, Rowe, Aslami, Lee, & Soltani, 2017; Donahue & Lee, 2015; Soltani & Wang, 2006, 2008; Lau & Glimcher, 2007; Sugrue et al., 2004). In addition to better understanding choice behavior, elucidating the relationship between sensory and reward processing can also be used to disambiguate neural mechanisms underlying attention and reward (Maunsell, 2004, 2015; Hikosaka, 2007) and how deficits in deployment of selective attention, which is characterized by changes in sensory processing, are affected by abnormalities in reward circuits (Volkow et al., 2009).

Here, we used a novel experimental paradigm with a dynamic reward schedule to simultaneously measure the influences of reward on choice between available targets and processing of visual information of these targets. We exploited the influence of visual motion on the trajectory of saccadic eye movements (Schafer & Moore, 2007), motion-induced bias (MIB), to quantify sensitivity to visual motion as a behavioral readout of visual processing in a criterion-free manner. Using this measure in the context of a saccadic free-choice task in monkeys allowed us to simultaneously estimate how reward feedback is integrated to determine both visual processing and decision-making on a trial-by-trial basis. We then used different approaches to compare the effects of objective reward value (i.e., total harvested reward, and more vs. less rewarding target based on task parameters) and subjective reward value (i.e., estimated reward values of the two targets using choice data) on decision-making and visual processing. To estimate subjective reward values on each trial, we fit choice behavior using multiple reinforcement learning (RL) models to examine how animals integrated reward outcomes over time and to determine choice. On the basis of the literature on reward learning, the difference in subjective values should drive choice behavior. The MIB could be independent of subjective reward value, or it could depend on subjective values similarly to or differently than choice. To test these alternative possibilities, we then used correlation between the MIB and estimated subjective values based on different integrations of reward feedback and on different timescales to examine similarities and differences between the effects of subjective reward value on choice and visual processing.

We found that both choice and sensitivity to visual motion were affected by reward although visual motion was irrelevant for obtaining reward in our experiment. However, there were separable influences of reward on these two processes. First, choice was modulated by both objective and subjective reward values, whereas sensitivity to visual motion was mainly influenced by subjective reward value. Second, choice was most strongly correlated with the difference in subjective values of the chosen and unchosen targets, whereas sensitivity to visual motion was most strongly correlated with the sum of subjective values. Finally, choice and sensitivity to visual motion were best predicted based on different types of reward integration and integration on different timescales.

## METHODS

### Subjects

Two male monkeys (Macaca mulatta) weighing 6 kg (Monkey 1) and 11 kg (Monkey 2) were used as subjects in the experiment. The two monkeys completed 160 experimental sessions (74 and 86 sessions for Monkeys 1 and 2, respectively) on separate days in the free-choice task for a total of 42,180 trials (10,096 and 32,084 trials for Monkeys 1 and 2, respectively). Each session consisted of approximately 140 and 370 trials for Monkeys 1 and 2, respectively. All surgical and behavioral procedures were approved by the Stanford University Administrative Panel on Laboratory Animal Care and the consultant veterinarian and were in accordance with National Institutes of Health and Society for Neuroscience guidelines.

### Visual Stimuli

Saccade targets were drifting sinusoidal gratings within stationary, 5°–8° Gaussian apertures. Gratings had a spatial frequency of 0.5 cycle/degree and Michelson contrast between 2% and 8%. Target parameters and locations were held constant during an experimental session. Drift speed was 5°/sec in a direction perpendicular to the saccade required to acquire the target. Targets were identical on each trial with the exception of drift direction, which was selected randomly and independently for each target.

After acquiring fixation on a central fixation spot, the monkey waited for a variable delay (200–600 msec) before the fixation spot disappeared and two targets appeared on the screen simultaneously (Figure 1A). Targets appeared equidistant from the fixation spot and diametrically opposite one another. The monkeys had to make a saccadic eye movement to one of the two targets to select that target and obtain a possible reward allocated to it (see Reward Schedule section). Both targets disappeared at the start of the eye movement. If the saccadic eye movement shifted the monkey's gaze to within a 5–8°-diameter error window around the target within 400 msec of target appearance, the monkeys received a juice reward according to the variable reward schedule described below.

Figure 1.

The free-choice task and reward schedule example. (A) Task design. On each trial, a fixation point appeared on the screen, followed by the presentation of two drifting–grating targets. The monkeys indicated their selection with a saccade. Targets disappeared at the onset of the saccade. A juice reward was delivered on a variable schedule after the saccade. Event plots indicate the sequence of presentation of the visual targets; dashed lines denote variable time intervals. Horizontal eye position traces are from a subset of trials of an example experiment and show selection saccades to both the left target (TL, downward deflecting traces) and the right target (TR, upward deflecting traces). (B) Examples of reward probability as a function of the percentage of left choices, fL, separately for the left and right targets (prL[fL, r, x] and prR[fL, r, x]) for different values of reward parameter r and penalty parameter x (see Equation 1). (C) Plotted is the reward harvest rate on each target as a function of the percentage of TL selections, fL, for r = 80 and x = 0. (D) Total reward harvest rate as a function of reward parameter r and fL for x = 0. The gray dashed line shows fL = r corresponding to matching behavior. The black dashed line indicates the percentage of TL selections that results in the optimal reward rate. Slight undermatching corresponds to optimal choice behavior in this task. rew. = reward.

Figure 1.

The free-choice task and reward schedule example. (A) Task design. On each trial, a fixation point appeared on the screen, followed by the presentation of two drifting–grating targets. The monkeys indicated their selection with a saccade. Targets disappeared at the onset of the saccade. A juice reward was delivered on a variable schedule after the saccade. Event plots indicate the sequence of presentation of the visual targets; dashed lines denote variable time intervals. Horizontal eye position traces are from a subset of trials of an example experiment and show selection saccades to both the left target (TL, downward deflecting traces) and the right target (TR, upward deflecting traces). (B) Examples of reward probability as a function of the percentage of left choices, fL, separately for the left and right targets (prL[fL, r, x] and prR[fL, r, x]) for different values of reward parameter r and penalty parameter x (see Equation 1). (C) Plotted is the reward harvest rate on each target as a function of the percentage of TL selections, fL, for r = 80 and x = 0. (D) Total reward harvest rate as a function of reward parameter r and fL for x = 0. The gray dashed line shows fL = r corresponding to matching behavior. The black dashed line indicates the percentage of TL selections that results in the optimal reward rate. Slight undermatching corresponds to optimal choice behavior in this task. rew. = reward.

### Quantifying the MIB

Eye position was monitored using the scleral search coil method (Judge, Richmond, & Chu, 1980; Fuchs & Robinson, 1966) and digitized at 500 Hz. Saccades were detected using previously described methods (Schafer & Moore, 2007). Directions of drifting gratings were perpendicular to the saccade required to choose the targets. Saccades directed to drifting–grating targets are displaced in the direction of visual motion, an effect previously referred to as the MIB (Schafer & Moore, 2007). The MIB for each trial was measured as the angular deviation of the saccade vector in the direction of the chosen target's drift, with respect to the mean saccade vector from all selections of that target within the session. This method of measuring deviation would yield approximately the same results as vertical displacement because the locations of targets were held constant throughout the session and angles were small, making angles a good approximation for the tangent of angles times the horizontal distance of the targets (vertical displacement). To compare MIB values across sessions with different target contrasts and locations, we used z score values of the MIB in each session to avoid confounds because of systematic biases.

### Reward Schedule

For each correct saccade, the monkey could receive a juice reward with a probability determined by a dynamic reward schedule based on the location of the foveated target (Abe & Takeuchi, 1993). More specifically, the probability of reward given a selection of the left (TL) or right (TR) target, prL and prR, was equal to
$prLfLrx=11+exp−−fL+r+10s−xprRfLrx=11+exp−+fL−r+10s−x$
(1)
where fL is the local fraction (in percentage) of TL selections estimated using the previous 20 trials, r (reward parameter) is a task parameter that was fixed on a given session of the experiment and determined which option was globally more valuable (TL for r > 50 and TR for r < 50), s is another task parameter that determines the extent to which the deviation from matching (corresponding to fL = r) results in a decrease in reward probability and was set to 7 in all experimental sessions, and x is a penalty parameter that reduced the global probability of a reward. Positive values of x decreased reward probability on saccades to both left and right targets to further motivate monkeys to identify and choose the more rewarding location at the time. x was kept constant throughout a session and was assigned to one of the following values on a fraction of sessions (reported in the parentheses in percentage): 0 (77%), 0.15 (6%), 0.30 (6%), or 0.40 (11%). Although the introduction of penalty decreased the reward probability and rate on both targets, it did not change the local choice fraction (fL) at which the optimal reward rate or matching could be achieved. Because of the penalty parameter and the structure of the reward schedule, prL(fL, r, x) and prR(fL, r, x) are not necessarily complementary. Finally, to ensure that the reward probabilities would not have negative values, any negative reward probability (based on Equation 1) is replaced with 0.

On the basis of the above equations, the reward probabilities on saccades to the left and right targets are equal at fL = r, corresponding to matching behavior, which is slightly suboptimal in this task. As shown in Figure 1C and D, an optimal reward rate is obtained via slight undermatching. As the value of s approaches zero, matching and optimal behaviors become closer to each other.

### RL Models

In our experiment, reward was assigned based on target location (left vs. right), and thus, the targets' motion directions were irrelevant for obtaining reward. Nevertheless, we considered the possibility that monkeys could incorrectly assign value to motion direction. We used various RL models to fit choice behavior to determine whether monkeys attributed reward outcomes to target locations or target motions and how they integrated these outcomes over trials to estimate subjective values and guide choice behavior. Therefore, we considered RL models that estimate subjective reward values associated with target locations as well as RL models that estimate subjective reward values associated with the motion of the two targets.

In the models based on the location of the targets (location-based RLs), the left and right targets (TL and TR) were assigned subjective values VL(t) and VR(t), respectively. In the models based on motion direction of the targets (motion-based RLs), subjective values VU(t) and VD(t) were assigned to the upward and downward motion (TU and TD), respectively. For both types of models, values were updated at the end of each trial according to different learning rules described below. In addition, we assumed that the probability of selecting TL (or TU in motion-based RLs) is a sigmoid function of the difference in subjective values as follows:
$pTL/U=11+exp−VL/Ut−VR/Dt−b$
(2)
where b quantifies the bias in choice behavior toward the left target (or upward motion) and VL/U denotes the subjective value of the left target in the location-based RL or upward motion in the motion-based RL, respectively. Similarly, VR/D denotes the subjective value of the right target in the location-based RL or downward motion in the motion-based RL, respectively.

At the end of each trial, subjective reward values of one or both targets were updated depending on the choice and reward outcome on that trial. We considered different types of learning rules for how reward outcomes are integrated over trials and grouped these learning rules depending on whether they estimate a quantity similar to “return” (average reward per selection) or “income” (average reward per trial). More specifically, on each trial, the monkeys could update subjective reward value of the chosen target only, making the estimated reward values resemble local (in time) return. Alternatively, the monkeys could update subjective reward values of both the chosen and unchosen targets, making these values resemble local income. We adopted these two methods for updating subjective reward values because previous work has shown that both local return and income can be used to achieve matching behavior (Soltani & Wang, 2006; Corrado, Sugrue, Seung, & Newsome, 2005; Sugrue et al., 2004). In addition, subjective reward values for the chosen and unchosen targets could be discounted when updating these values on subsequent trials similarly or differently, and monkeys could learn differently from positive (reward) and negative (no reward) outcomes. We tested all these possibilities using four different types of RL models.

In return-based RL models (RLret), only the subjective value of the chosen target (in terms of location or motion direction) was updated. More specifically, if TL(TU) was selected and rewarded on trial t, subjective reward values were updated as the following:
$VL/Ut+1=αVL/Ut+ΔrVR/Dt+1=VR/Dt$
(3)
where Δr quantifies the change in subjective reward value after a rewarded trial and α (0 ≤ α ≤ 1) is the discount factor measuring how much the estimated subjective reward value from the previous trial is carried to the current trial. As a result, values of α closer to 1 indicates longer lasting effects of reward or integration of reward on longer timescales, both of which indicate slower learning. In contrast, values of α closer to 0 indicate integration of reward on shorter timescales corresponding to faster learning. We note that our learning rule is not a delta rule, and because of its form, (1 − α) in our models more closely resembles learning rate in RL models based on the delta rule. If TL(TU) was selected but not rewarded, subjective reward values of the two target locations or motion directions were updated as the following:
$VL/Ut+1=αVL/Ut+ΔnVR/Dt+1=VR/Dt$
(4)
where Δn quantifies the change in subjective reward value after a nonrewarded trial. Similar equations governed the update of subjective reward values when TR(TD) was selected. Importantly, in these models, subjective reward value of the unchosen target (in terms of location or motion) is not updated, making these models return-based.
In contrast, in all other models, subjective reward values of both chosen and unchosen targets were updated in every trial, making them income-based models. Specifically, in the RLInc(1) models, the subjective value of the unchosen target was discounted on the subsequent trial similarly to the subjective value of the chosen target. For example, when TL(TU) was selected, the subjective values were updated as follows:
$VL/Ut+1=αVL/Ut+ΔrorΔnfornorewardVR/Dt+1=αVR/Dt$
(5)
In the RLInc(2) models, subjective values of chosen and unchosen targets were discounted on the subsequent trial differently:
$VL/Ut+1=αcVL/Ut+ΔrorΔnfornorewardVR/Dt+1=αuVR/Dt$
(6)
where αc, and αu are the discount factors for the chosen and unchosen targets or motion directions.
In the RLInc(3) models, we updated the subjective value of unchosen target location (or unchosen motion direction) in addition to discounting the subjective values of chosen and unchosen locations:
$VL/Ut+1=αcVL/Ut+ΔrorΔnfornorewardVR/Dt+1=αuVR/Dt+Δu$
(7)
Note that the motion directions of the two targets were the same in half of the trials. This makes updating of subjective value of motion directions nontrivial in trials in which the chosen and unchosen motion directions are the same (referred to as match trials). Therefore, we tested different update rules for match trials to identify the model that best describes the monkeys' choice behavior. Specifically, we tested two possibilities: (1) update the subjective value of motion direction that was presented on a given match trial only and (2) update the subjective values of both present and nonpresent motion directions but in the opposite direction. We found that the second model, in which subjective values of both motion directions were updated, provided a better fit for our data (data not shown).

Finally, we also tested hybrid RL models in which subjective values of both target locations and motion directions were updated at the end of each trial and subsequently used to make decisions. Fitting based on these hybrid models was not significantly better than those using the RL models that consider only subjective values of target locations. Therefore, the results from these hybrid models are not presented here.

### Model Fitting and Comparison

We used the maximum likelihood ratio method to fit choice behavior with different RL models described above and estimated the parameters of those models. To compare the goodness-of-fit based on different models while considering the number of model parameters, we used the negative log-likelihood (−LL), Akaike information criterion (AIC), and Bayesian information criterion (BIC). AIC is defined as
$AIC=−2×LL+2×k$
(8)
where LL is log-likelihood of the fit and k is the number of parameters in a given model. BIC is defined as
$BIC=−2×LL+lnn×k$
(9)
where LL is log-likelihood of the fit, k is the number of parameters in a given model, and n is the number of trials in a given session. We then used the best RL model in terms of predicting choice behavior to examine whether the MIB is also affected by subjective reward value similarly to or differently than choice (see below).

### Effects of Subjective Reward Value on MIB

To estimate subjective reward values associated with a given target location, we used two methods of reward integration corresponding to income and return. To calculate the subjective income for a given target location on a given trial, we filtered the sequences of reward outcomes on preceding trials (excluding the current trial) using an exponential filter with a given time constant τ, assigning +1 to rewarded trials and Δn to nonrewarded trials if that target location was chosen and 0 if that target location was not chosen on the trial. To calculate the subjective return of a given target location, we filtered reward sequence on preceding trials (again excluding the current trial) in which that target location was chosen using an exponential filter with a given time constant τ, assigning +1 to rewarded trials and Δn to nonrewarded trials. Finally, we calculated the correlation between the MIB and the obtained filtered values for different values of τ and Δn.

### Data Analysis

To assess the overall performance of the monkeys, we used static and dynamic models to harvest maximum rewards. In the static model, we assumed that selection between the two target locations in a given session was a stochastic process with a fixed probability that is optimized for a given set of parameters. Replacing fL with to-be-determined probability p(TL) in Equation 1, one can obtain the total average reward on the two targets, Rtot, as follows:
$Rtot=pTL*prLfLrx+pTR*prRfLrx=pTL*11+exp−−pTL+r+10s−x+1−pTL*11+exp−+pTL−r+10s−x,$
(10)
The optimal probability, popt(TL), was then determined by maximizing Rtot:
$poptTL=argmaxpTLRtot$
(11)
In the optimal dynamic model, we assumed that the decision maker has access to all the parameters of the reward schedule (r, s, x) and perfect memory of their own choices in terms of fL. Having this knowledge, the optimal decision maker could compute the probability of reward on the two options, prL(fL, r, x) and prR(fL, r, x); (using Equation 1) and choose the option with the higher reward probability on every trial.

We also compared the monkeys' choice behavior with the prediction of the matching law. The matching law states that the animals allocate their choices in a proportion that matches the relative reinforcement obtained by the choice options. In our experiment, this is equivalent to the relative fraction of left (respectively, right) choices to match the relative fraction of incomes on the left (respectively, right) choices. Therefore, to quantify deviations from matching, we calculated the difference between the relative fraction of choosing the more rewarding target (left when r > 50 and right when r < 50) and the relative fraction of the income for the more rewarding target. Negative and positive values correspond to undermatching (choosing the better option less frequently than the relative reinforcement) and overmatching, respectively.

## RESULTS

We trained two monkeys to freely select between two visual targets via saccadic eye movement (Figure 1A). Saccades to each target resulted in delivery of a fixed amount of juice reward with a varying probability. Targets were stationary apertures of drifting gratings, and the reward probability was determined based on the location of the grating targets independently of the direction of visual motion contained within the gratings. More specifically, on a given trial, probabilities of reward on the left and right targets were determined by the reward parameter (r) and the choice history on the preceding 20 trials (Equation 1; Figure 1B). Critical for our experimental design, the motion contained within the targets caused the end points of eye movements made to those targets to be systematically biased in the direction of grating motion (MIB). We first show that this MIB can be used as a measure of sensitivity to visual motion on a trial-by-trial basis. Next, we use an exploratory approach to study whether and how the effects of reward on choice are different or similar to the effects of reward on sensitivity to visual motion measured by the MIB. In this approach, we rely on known effects of objective and subjective reward values on choice and then test those effects for the MIB.

### MIB Measures Sensitivity to Visual Motion

The MIB of a saccadic eye movement quantifies the extent to which the end points of saccades directed toward the drifting gratings were biased in the direction of grating motion (Figures 1A and 2A). Despite the stationary position of the grating aperture, motion in the drifting sinusoid nonetheless induces a shift in the perceived position of the aperture in human participants (De Valois & De Valois, 1991) and biases saccadic end points in the direction of grating drift in monkeys (Schafer & Moore, 2007). By examining the MIB in different conditions, we established that it can provide a measure of sensitivity to visual motion even when the grating motion is not behaviorally relevant.

Figure 2.

MIB measures sensitivity to visual motion. (A) Plotted are the example distributions of the angle of saccade vector (relative to the fixation dot) for upward (open) and downward (filled) drifting targets. (B) MIB significantly increased as the contrast of grating is increased from 2% (purple) to 3% (yellow). (C) Comparison of the z score normalized MIB when the directions of motion in the chosen and nonchosen targets matched or did not match. The MIB is z score normalized for each monkey separately within each session. The asterisk shows a significant difference between the two contrasts using two-sided t test (p < .05).

Figure 2.

MIB measures sensitivity to visual motion. (A) Plotted are the example distributions of the angle of saccade vector (relative to the fixation dot) for upward (open) and downward (filled) drifting targets. (B) MIB significantly increased as the contrast of grating is increased from 2% (purple) to 3% (yellow). (C) Comparison of the z score normalized MIB when the directions of motion in the chosen and nonchosen targets matched or did not match. The MIB is z score normalized for each monkey separately within each session. The asterisk shows a significant difference between the two contrasts using two-sided t test (p < .05).

First, we found that the magnitude of the MIB depended on the grating contrast. More specifically, the MIB increased by 27% when the (Michaelson) contrast of grating increased from 2% to 3% (two-sided independent measures t test, p = 7.85 × 10−9; Figure 2B). Second, we observed that the MIB depended almost exclusively on the motion direction of the selected target as it was only slightly affected by nonmatching motion in the unchosen target (Figure 2C). Specifically, the average z score normalized MIB measured in two monkeys across all trials (mean = 0.38) was altered by only 9% when the unchosen target differed in direction of the grating motion. Together, these results demonstrate that the MIB in our task is sensitive to the properties of sensory signal (grating motion direction and contrast) and thus can be used to measure the influence of internal factors such as subjective reward value on visual processing.

### Effects of Objective Reward Value on Choice Behavior

To examine the effects of global and objective reward value on monkeys' choice behavior, we first measured how monkeys' choice behavior tracked the target location that was globally (session-wise) more valuable, which in our task is set by reward parameter r. We found that target selection was sensitive to reward parameter in both monkeys and the harvested reward rate was high, averaging 0.66 and 0.65 across all sessions (including those with penalty) in Monkeys 1 and 2, respectively (Figure 3A, B, D, and E). To better quantify monkeys' performance, we also computed the overall harvested reward by a model that selects between the two targets with the optimal but fixed choice probability in a given session (optimal static model; see example in Figure 1C in Methods) or a model in which the target with a higher probability of reward was chosen on each trial (optimal dynamic model; see Methods). We found that the performance of both monkeys was suboptimal; however, the pattern of performance as a function of reward parameter for Monkeys 1 and 2 resembled the behavior of the optimal static and dynamic models, respectively (Figure 3B and E). Because each session of the experiment for Monkey 2 was longer, we confirmed that there was no significant difference in task performance between the first and second halves of sessions for Monkey 2 (difference: mean = 0.003, SEM = 0.008; two-sided paired t test: p = .7, d = 0.03). Together, these results suggest that both monkeys followed the reward schedule on each session closely, whereas their choice behavior was suboptimal.

Figure 3.

Global (session-wise) effects of reward on choice behavior. (A) Choice behavior was sensitive to reward parameter. Percentage of TL selections is plotted as a function of r, which varied across experimental sessions for Monkey 1. The colored lines are linear fits, and the black dashed line shows the optimal fL for a given value of r assuming selection between the two targets with a fixed probability (optimal static model). The gray dashed line shows unit slope. Each data point corresponds to one session of the experiment. (B) The overall performance was suboptimal. Plotted is harvested rewards per trial as a function of reward parameter r for zero penalty sessions for Monkey 1. The solid colored lines show fit using a quadratic function. The colored and black dashed lines indicate harvested reward rates of the optimal dynamic and static models, respectively. (C) Proportion of TL selections is plotted as a function of the fraction of harvested reward on the left target. The colored lines are linear fits, and the gray dashed line shows the diagonal line corresponding to matching behavior. Monkey 1 showed significant undermatching by selecting the more rewarding target with a choice fraction smaller than reward fraction. The inset shows the difference between choice and reward fractions with negative and positive values corresponding to undermatching and overmatching. The gray dashed lines indicate the medians of the distributions and asterisks show the significant difference from 0 (i.e., matching) using Wilcoxon signed rank test (p < .05). (D–F) Similar to A–C but for Monkey 2.

Figure 3.

Global (session-wise) effects of reward on choice behavior. (A) Choice behavior was sensitive to reward parameter. Percentage of TL selections is plotted as a function of r, which varied across experimental sessions for Monkey 1. The colored lines are linear fits, and the black dashed line shows the optimal fL for a given value of r assuming selection between the two targets with a fixed probability (optimal static model). The gray dashed line shows unit slope. Each data point corresponds to one session of the experiment. (B) The overall performance was suboptimal. Plotted is harvested rewards per trial as a function of reward parameter r for zero penalty sessions for Monkey 1. The solid colored lines show fit using a quadratic function. The colored and black dashed lines indicate harvested reward rates of the optimal dynamic and static models, respectively. (C) Proportion of TL selections is plotted as a function of the fraction of harvested reward on the left target. The colored lines are linear fits, and the gray dashed line shows the diagonal line corresponding to matching behavior. Monkey 1 showed significant undermatching by selecting the more rewarding target with a choice fraction smaller than reward fraction. The inset shows the difference between choice and reward fractions with negative and positive values corresponding to undermatching and overmatching. The gray dashed lines indicate the medians of the distributions and asterisks show the significant difference from 0 (i.e., matching) using Wilcoxon signed rank test (p < .05). (D–F) Similar to A–C but for Monkey 2.

We also examined the global effects of reward on choice by measuring matching behavior. To that end, we compared choice and reward fractions in each session and found that both monkeys exhibited undermatching behavior (Figure 3C and F). More specifically, they selected the more rewarding location with a probability that was smaller than the relative reinforcement obtained on that location (Monkey 1 median [choice fraction − reward fraction] = −0.115; Wilcoxon signed rank test, p = 1.67 × 10−12, d = −1.35; Figure 3C inset; Monkey 2 median [choice fraction − reward fraction] = −0.03, p = 1.76 × 10−12, d = −0.83; Figure 3F inset). Furthermore, the degree of undermatching was larger for Monkey 1 than Monkey 2 (difference = −0.086; Wilcoxon rank sum test, p = 2.57 × 10−6, d = −0.37).

### Effects of Objective Reward Value on Sensitivity to Visual Motion

In the previous section, we observed that choice behavior is affected by objective measures of reward value in a given session. We repeated similar analyses to examine whether objective reward values have similar effects on sensitivity to visual motion measured by MIB. To that end, we first computed the correlation between the difference in the session-based average MIB for saccades to the more and less rewarding target locations and reward parameter r in each session. However, we did not find any evidence for such correlation for either of the two monkeys (Spearman correlation; Monkey 1: r = .04, p = .8; Monkey 2: r = .11, p = .39). Second, we examined whether the average MIB for all saccades in a given session was affected by the overall performance in that session. Again, we did not find any evidence for correlation between the session-based average MIB and performance for either of the two monkeys (Spearman correlation; Monkey 1: r = −.07, p = .54; Monkey 2: r = .13, p = .21). Finally, we did a similar analysis to matching behavior to examine whether differential MIB on the two target locations is related to objective reward values of those locations. In this analysis, we computed correlation between the difference in average MIB on the better and worse target locations and the difference in total reward obtained on those locations but found no evidence for such correlation (Spearman correlation; Monkey 1: r = .003, p = .99; Monkey 2: r = .03, p = .83).

Together, these results indicate that, unlike choice, the MIB is not affected by objective reward value of the foveated target or the overall harvested reward. Observing this dissociation, we next examined the effects of subjective reward value on choice and the MIB.

### Effects of Subjective Reward Value on Choice Behavior

The analyses presented above show that the overall choice behavior was influenced by global or objective reward value of the two target locations in a given session. In contrast, sensitivity to visual motion was not affected by global or objective reward value. This difference between the influence of objective reward value on choice and MIB could simply reflect the fact that, because of task design, monkeys' choices and not MIB determine reward outcomes on current trials and influence reward probability on subsequent trials (reward probability was a function of r and monkeys' choices on the preceding trials). Therefore, we next examined similarities and differences between effects of subjective reward value on choice behavior and sensitivity to visual motion.

To investigate how reward outcomes were integrated over time to estimate subjective reward values and guide monkeys' choice behavior on each trial, we used multiple RL models to fit the choice behavior of individual monkeys on each session of the experiment. These models assume that selection between the two targets is influenced by subjective values associated with each target, which are updated on each trial based on reward outcome (see Methods). Although reward was assigned based on the location of the two targets (left vs. right) in our experiment, the monkeys could still assume that motion direction is informative about reward. Therefore, we considered RL models in which subjective values were associated with target locations as well as RL models in which subjective values were associated with the motion of the two targets, using four different learning rules. Considering the observed undermatching behavior, we grouped learning rules depending on whether they result in the estimation of subjective value in terms of local (in time) return or income.

In RLret models, only the subjective value of the chosen target (in terms of location or motion) was updated, making them return-based models. In RLInc(1) models, in addition to updating the subjective value of the chosen target, the subjective value of the unchosen target was discounted on the subsequent trials similarly to the subjective value of the chosen target, making these models income-based. In RLInc(2) models, the subjective values of chosen and unchosen targets were allowed to be discounted on the subsequent trials differently. Finally, in RLInc(3) models, we also assumed a change in the subjective value of the unchosen target or motion direction in addition to the discounting across trials. Because the subjective values of both chosen and unchosen target locations were updated on each trial in RLInc(2) and RLInc(3) models, we refer to these models as income-based similarly to RLInc(1). However, we note that only RLInc(1) models are able to estimate local income accurately.

We first compared the goodness-of-fit between the location-based and motion-based RLs using −LL, AIC, and BIC to test which of the two types of models can predict choice behavior better. Such comparisons based on the three measures yield the same results because the two types of models have the same number of parameters for a given learning rule. We found that, for both monkeys, all the location-based models outperformed the motion-based RLs (Table 1). This demonstrates that both monkeys attributed reward outcomes to target locations more strongly than to target motions and used subjective value attributed to target locations to perform the task.

Table 1.
Comparison of Goodness-of-Fit between Location-based and Motion-based RL Models Using −LL, AIC, or BIC
RLretRLInc(1)RLInc(2)RLInc(3)
Monkey 1 Δ(−LL, AIC, or BIC) = −5.48 Δ(−LL, AIC, or BIC) = −7.19 Δ(−LL, AIC, or BIC) = −6.53 Δ(−LL, AIC, or BIC) = −8.06
p = 2.55 × 10−7 p = 2.58 × 10−9 p = 1.39 × 10−8 p = 5.32 × 10−10
Monkey 2 Δ(−LL, AIC, or BIC) = −60.96 Δ(−LL, AIC, or BIC) = −105.71 Δ(−LL, AIC, or BIC) = −103.46 Δ(−LL, AIC, or BIC) = −107.25
p = 2.74 × 10−21 p = 2.58 × 10−26 p = 2.58 × 10−26 p = 2.58 × 10−26
RLretRLInc(1)RLInc(2)RLInc(3)
Monkey 1 Δ(−LL, AIC, or BIC) = −5.48 Δ(−LL, AIC, or BIC) = −7.19 Δ(−LL, AIC, or BIC) = −6.53 Δ(−LL, AIC, or BIC) = −8.06
p = 2.55 × 10−7 p = 2.58 × 10−9 p = 1.39 × 10−8 p = 5.32 × 10−10
Monkey 2 Δ(−LL, AIC, or BIC) = −60.96 Δ(−LL, AIC, or BIC) = −105.71 Δ(−LL, AIC, or BIC) = −103.46 Δ(−LL, AIC, or BIC) = −107.25
p = 2.74 × 10−21 p = 2.58 × 10−26 p = 2.58 × 10−26 p = 2.58 × 10−26

Δ(−LL, AIC, or BIC) shows the median of the difference between location-based and motion-based RL models fitted for each session separately. Note that all differences in goodness-of-fit measures (based on −LL, AIC, and BIC) are similar because the number of parameters is the same across location-based and motion-based models. p Values indicate the significance of the statistical test (two-sided sign test) for comparing the goodness-of-fit between the location-based and motion-based RLs.

After establishing that monkeys used target location to integrate reward outcomes, we next examined how this integration was performed by comparing the quality of fit in location-based models with different learning rules. We found that, for Monkey 1, RLret and RLInc(1) models provided the best fit of choice data; although goodness-of-fit measures were not significantly different between these models, these models provided better fits than the RLInc(2) and RLInc(3) models (Figure 4). Interestingly, fitting choice behavior with the RLInc(1) model resulted in the discount factors (α) that were close to 1 for many sessions (mean and median of α were equal to 0.77 and 1.0, respectively). This result indicates that Monkey 1 integrated reward over many trials to guide its choice behavior. This is compatible with the pattern of performance as a function of reward parameter for this monkey (Figure 3B), which resembles the pattern of the optimal static model.

Figure 4.

Comparison of goodness-of-fit between different location-based RL models reveals that the RLInc(1) model provided the best overall fit. (A) The difference between BIC for fits based on the RLInc(1) model and the three competing models (indicated on the x axis). Bars show the median of the difference in BIC, and errors are SEM. Reported p values are based on a two-sided sign test. Each data point shows the goodness-of-fit for one session of the experiment. For Monkey 1, fits based on the RLInc(1) and RLret models were not significantly different. (B) The same as in A but based on the difference in AIC. (C, D) Similar to A and B but for Monkey 2. An asterisk indicates that the difference between two models is significantly different from 0 using two-sided sign test (p < .05).

Figure 4.

Comparison of goodness-of-fit between different location-based RL models reveals that the RLInc(1) model provided the best overall fit. (A) The difference between BIC for fits based on the RLInc(1) model and the three competing models (indicated on the x axis). Bars show the median of the difference in BIC, and errors are SEM. Reported p values are based on a two-sided sign test. Each data point shows the goodness-of-fit for one session of the experiment. For Monkey 1, fits based on the RLInc(1) and RLret models were not significantly different. (B) The same as in A but based on the difference in AIC. (C, D) Similar to A and B but for Monkey 2. An asterisk indicates that the difference between two models is significantly different from 0 using two-sided sign test (p < .05).

The same analysis for Monkey 2 revealed a similar integration of reward outcomes but on a different timescale. More specifically, we found that the RLInc(1) model provided the best fit for choice behavior as the goodness-of-fit in this model was better than the return-based model (RLret) and more detailed income-based models (RLInc[2] and the RLInc[3]; Figure 4). In contrast to Monkey 1, the estimated discount factors based on the RLInc(1) model were much smaller than 1 for many sessions for Monkey 2 (mean and median α were equal to 0.32 and 0.33, respectively). These results indicate that Monkey 2 integrated reward over a shorter timescale (a few trials) than Monkey 1 to guide its choice behavior. This is compatible with the pattern of performance as a function of reward parameter for this monkey (Figure 3E), which more closely resembles the pattern of the optimal dynamic model.

Together, fitting of choice behavior shows that both monkeys associated reward outcomes with the location of the chosen target. Moreover, both monkeys estimated subjective reward values in terms of income by integrating reward outcomes over multiple trials and used these values to make decisions.

### Effects of Subjective Reward Value on Sensitivity to Visual Motion

Our experimental design allowed us to simultaneously measure choice and the MIB, as an implicit measure of sensitivity to visual motion, on each trial. We next examined whether subjective reward value based on the integration of reward outcomes over time influenced sensitivity to visual motion.

To that end, we first examined whether reward feedback had an immediate effect on the MIB in the following trial. Combining the data of both monkeys, we found that the MIB was larger in the trials that were preceded by a rewarded rather than an unrewarded trial (mean = 0.03, SEM = 0.009; two-sided t test: p = 6.95 × 10−4, d = 0.18). When considering data from each monkey individually, however, this effect only retained significance for Monkey 1 (Monkey 1: mean = 0.05, SEM = 0.01; two-sided t test, p = 6.5 × 10−4, d = 0.09; Monkey 2: mean = 0.01, SEM = 0.01; two-sided t test, p = .21, d = 0.09). These results suggest that the MIB is weakly affected by the immediate reward outcome in the preceding trial.

In the previous section, we showed that the best model for fitting choice behavior was one that estimates subjective reward value based on the income on each target location and uses the difference in incomes to drive choice behavior (RLInc[1] model; Figure 4). However, it is not clear if the MIB is influenced by subjective reward values of the two targets in a similar fashion. To test this relationship, we computed correlations between the trial-by-trial MIB and estimated subjective reward values of the chosen target location, the unchosen target location, and their sum and difference. We considered subjective reward values based on both income and return (see Effects of Subjective Reward Value on MIB section in Methods).

We made several key observations. First, we found that the MIB was positively correlated with subjective reward values of both the chosen and unchosen targets (Figure 5A, B, E, and F) and, as a result, was most strongly correlated with the sum of subjective reward values of the two targets (Figure 5C and G). In contrast to choice, the MIB was poorly correlated with the difference in subjective reward values of the chosen and unchosen targets (Figure 5D and H; Supplementary Figure 1 [http://ccnl.dartmouth.edu/Soltani_etal_20_JoCN/SuppFig1.pdf]). Therefore, choice was most strongly correlated with the difference in subjective reward values, whereas the MIB was most strongly correlated with the sum of subjective reward values from the two targets. Second, although the aforementioned relationships were true for subjective reward value based on return and income, we found that correlations between the MIB and subjective return values were stronger than correlations between the MIB and subjective income values (compare Figure 5 and Supplementary Figure 2 [http://ccnl.dartmouth.edu/Soltani_etal_20_JoCN/SuppFig2.pdf]). Third, the maximum correlation occurred for the values of τ at around 15–20 trials and for negative values of Δn, similarly for both monkeys. This indicates that, for both monkeys, the MIB was influenced by reward integrated over many trials, and the absence of reward on a given trial had a negative influence on the MIB on the following trials (Δn < 0).

Figure 5.

MIB was most strongly correlated with the sum of subjective reward values of the two targets based on return. (A–D) Plotted are the correlations between the MIB and subjective reward values of the chosen (A) and unchosen (B) targets based on return as well as their sum (C) and their difference (D) for different values of τ and Δn. The inset in each panel shows the correlation between the MIB and the corresponding subjective return values for different values of τ and a specific value of Δn (indicated with an arrow in the main panel C) for Monkey 1. The arrow in C points to the value of Δn that results in the maximum correlation between the MIB and the sum of subjective return values of the two targets for Monkey 1. (E–H) The same as in A–D but for Monkey 2. The arrow in G points to the value of Δn that results in the maximum correlation between the MIB and the sum of subjective return values of the two targets for Monkey 2.

Figure 5.

MIB was most strongly correlated with the sum of subjective reward values of the two targets based on return. (A–D) Plotted are the correlations between the MIB and subjective reward values of the chosen (A) and unchosen (B) targets based on return as well as their sum (C) and their difference (D) for different values of τ and Δn. The inset in each panel shows the correlation between the MIB and the corresponding subjective return values for different values of τ and a specific value of Δn (indicated with an arrow in the main panel C) for Monkey 1. The arrow in C points to the value of Δn that results in the maximum correlation between the MIB and the sum of subjective return values of the two targets for Monkey 1. (E–H) The same as in A–D but for Monkey 2. The arrow in G points to the value of Δn that results in the maximum correlation between the MIB and the sum of subjective return values of the two targets for Monkey 2.

Considering that local choice fraction (fL in Equation 1) has opposite effects on prL and prR because of task design, we also tested the relationship between estimated subjective values (based on return) for the two target locations. We found that the correlation between estimated reward values depends on the values of τ and Δn and is not always negative (data not shown). Nevertheless, for all values of τ and Δn, the MIB was most strongly correlated with the sum of subjective reward values while being positively correlated with the value of both chosen and unchosen targets. These results indicate that the dependence of the MIB on the sum of subjective value is not driven by our specific task design.

Finally, to better illustrate distinct effects of reward on decision-making and visual processing, we used two sets of parameters (τ = 15 and Δn = 0; τ = 15 and Δn = −0.5) that resulted in significant correlations between choice and targets' subjective income values (Supplementary Figure 1) and between the MIB and targets' subjective return values in all cases (Figure 5). We then used these two sets of parameters and choice history of the monkeys on the preceding trials to estimate subjective income values and return values in each trial (see Effects of Subjective Reward Value on MIB section in Methods). We then grouped trials into bins according to estimated subjective reward values of TL (left target) and TR (right target) for choice or of the chosen and unchosen targets for the MIB and computed the average probability of choosing the left target and the average MIB for each bin. We found that the probability of choosing the left target for both monkeys was largely determined by the difference in subjective values of the left and right targets, as can be seen from contours being parallel to the diagonals (Figure 6A, B, E, and F). In contrast, the MIB was largely determined by the sum of subjective values, as can be seen from contours being parallel to the second diagonals (Figure 6C, D, G, and H). These results clearly demonstrate that reward has distinct effects on choice behavior and sensitivity to visual motion.

Figure 6.

The choice probability for both monkeys was largely determined by the difference in estimated subjective values, whereas the MIB was largely determined by the sum of subjective values of targets. (A, B) Plots show the probability of choosing the left target as a function of subjective values of the left and right targets for Monkey 1, using τ = 15 and two values of Δn as indicated on the top. (C, D) Plots show the MIB as a function of subjective values of the chosen and unchosen targets for Monkey 1, using τ = 15 and two values of Δn as indicated on the top. (E–H) The same as in A–D but for Monkey 2.

Figure 6.

The choice probability for both monkeys was largely determined by the difference in estimated subjective values, whereas the MIB was largely determined by the sum of subjective values of targets. (A, B) Plots show the probability of choosing the left target as a function of subjective values of the left and right targets for Monkey 1, using τ = 15 and two values of Δn as indicated on the top. (C, D) Plots show the MIB as a function of subjective values of the chosen and unchosen targets for Monkey 1, using τ = 15 and two values of Δn as indicated on the top. (E–H) The same as in A–D but for Monkey 2.

## DISCUSSION

Experimental paradigms with dynamic reward schedules have been extensively used in different animal models to study how reward shapes choice behavior on a trial-by-trial basis (Donahue & Lee, 2015; Li, McClure, King-Casas, & Montague, 2006; Lau & Glimcher, 2005; Barraclough et al., 2004; Sugrue et al., 2004; Herrnstein, Loewenstein, Prelec, & Vaughan, 1993). A general finding is that animals integrate reward outcomes on one or more timescales to estimate subjective reward value and determine choice. In contrast, the influence of reward on selective processing of visual information, which is often described as attentional deployment, has been mainly studied using fixed reward schedules with unequal reward outcomes (Barbaro et al., 2017; Hickey & Peelen, 2017; Hickey et al., 2010, 2014; Anderson et al., 2011a, 2011b; Della Libera & Chelazzi, 2006, 2009; Peck et al., 2009). The main findings from these studies are that targets or features associated with larger reward can more strongly capture attention and alter visual processing immediately or even after extended periods (reviewed in Anderson, 2013, 2016).

However, it has proven difficult to link the effects of reward on saccadic choice and selective processing of visual information mainly because of separate measurements of these effects in different tasks. Indeed, the poorly described relationship between reward expectation and the processing of visual information has been implicated as a confounding factor in the interpretation of many past behavioral and neurophysiological results (Maunsell, 2004, 2015). An exception to this is a study by Serences (2008) in which the author utilized a task with a dynamic reward schedule to demonstrate that the activity in visual cortex is modulated by reward history (i.e., integrated reward outcomes over many trials). Compatible with these results, we find that processing of visual information is affected by subjective reward value estimated by integration of reward outcomes over many trials.

Using tasks designed specifically to dissociate subjective reward value from a target's behavioral significance, or salience, a few studies have identified brain areas that respond primarily to the expected reward or the salience of a target (or both) in various species including rats (Lin & Nicolelis, 2008), monkeys (Roesch & Olson, 2004), and humans (Litt, Plassmann, Shiv, & Rangel, 2011; Cooper & Knutson, 2008; Jensen et al., 2007; Anderson et al., 2003). However, in these studies, the saliency signal observed in neural responses might reflect a number of different processes, such as motivation, attention, motor preparation, or some combination of these. In the present work, we exploited the influence of visual motion on saccades as an independent and implicit measure of visual processing during value-based decision-making. This enabled us for the first time to measure choice and visual processing simultaneously and to test whether reward has differential effects on these two processes.

Although motion was not predictive of reward and thus processing of motion direction was not required to obtain a reward, we found that, similar to decision-making, visual processing was influenced by subjective reward values of the two targets. However, subjective reward values of the two targets affected visual processing differently than how they affected choice in three ways. First, although choice was correlated most strongly with the difference between subjective values of chosen and unchosen targets, visual processing was most strongly correlated with the sum of subjective values of the two targets. The latter indicates that the overall subjective value of targets in a given environment could influence the quality of sensory processing in that environment. Second, choice was more strongly affected by the subjective income value of the target, whereas sensitivity to visual motion was more strongly affected by subjective return values of the targets. Third, the time constant of reward integration and the impact of no reward were different between decision-making and visual information processing. In contrast to subjective reward value, we found that objective reward value only affected choice and not sensitivity to visual motion. Together, these results point to multiple systems for reward integration in the brain.

We found certain differences between the results for the two monkeys that could indicate that they used different, idiosyncratic strategies for performing the task. For example, fitting results of RL models indicated that Monkey 1 used the reward history over many trials to direct its choice behavior. In contrast, Monkey 2 used the reward history over few trials to direct its choice behavior. This difference was also apparent in the correlation between choice and the difference in subjective values of the two target locations. Despite this difference in integration time constant, choice in both monkeys was most strongly correlated with the difference between estimated subjective values of the two targets. Furthermore, the MIB for both monkeys was most strongly correlated with the sum of estimated subjective values of the two targets, although they integrated reward outcomes on different timescales.

The observed differences in reward effects on visual processing and decision-making have important implications for the involved brain structures and underlying neural mechanisms. First, they suggest that brain structures involved in decision-making and processing of visual information receive distinct sets of value-based input, for example, ones that integrate reward over a different number of trials. The set of input affecting decision-making carries information about subjective reward value of individual targets, whereas the set that affects visual processing carries information about the sum of subjective values of targets. Indeed, there are more neurons in ACC and other prefrontal areas that encode the sum of subjective value of available options than the subjective value of a given option (Kim, Hwang, Seo, & Lee, 2009), and these neurons might contribute to enhanced sensory processing. In addition, it has been shown that the activity of basal forebrain neurons increases with the sum of subjective values of choice array options (Ledbetter, Chen, & Monosov, 2016), and this could enable basal forebrain to guide visual processing and attention based on reward feedback independently of how reward controls choice behavior (Monosov, 2020). Finally, the FEF also receives inputs from the supplementary eye field, which contains neurons whose activity reflects subjective reward value of the upcoming saccade (Chen & Stuphorn, 2015). Such input from the supplementary eye field could drive target selection in the FEF. Importantly, our findings can be used in future experiments to tease apart neural substrates by which reward influences visual processing and decision-making.

Second, a plausible mechanism that could contribute to the observed differences in the effects of reward is the differential influence of dopaminergic signaling on the functions of FEF neurons. Recent work demonstrates that the modulatory influence of the FEF on sensory activity within visual cortex is mediated principally by D1 receptors and that D2-mediated activity is not involved (Noudoost & Moore, 2011). However, activity mediated through both receptor subtypes contributes to target selection, albeit in different ways (Soltani, Noudoost, & Moore, 2013; Noudoost & Moore, 2011). This evidence indicates that the neural mechanisms underlying target selection and visual processing are separable if only in terms of the involvement of different dopaminergic signals. Considering the known role of dopamine in reward processing (Schultz, 2007) and synaptic plasticity (Calabresi, Picconi, Tozzi, & Di Filippo, 2007), these two dopaminergic signaling pathways may provide a mechanism for the separate effects of reward on sensory processing and selection.

Third, in most choice tasks with dynamic reward schedules, local subjective return and income values are typically correlated, and the question of which quantity is the critical determinant of behavior has been debated for many years (Soltani & Wang, 2006; Corrado et al., 2005; Gallistel, Mark, King, & Latham, 2001; Gallistel & Gibbon, 2000; Mark & Gallistel, 1994; Herrnstein & Prelec, 1991). The observation that differences in subjective income values are a better predictor of choice behavior may reflect the fact that income values provide information about which target is globally more valuable in each session of the task. In contrast, the dependence of visual processing on the sum of subjective return values is more unexpected. This indicates that visual processing may more strongly depend on target-specific reward integration because the return value of a given target is updated only after selection of that target.

Finally, the separable influences of reward could be crucial for flexible behavior required in dynamic and high-dimensional reward environments (Farashahi, Rowe, et al., 2017). For example, processing of visual information of the saccade target that has multiple visual features based on the sum of subjective reward values of available targets could allow processing of previously neglected information from the less rewarding targets and thus improve exploration. Future studies are needed to test whether disruption of this processing can reduce flexibility in target selection and choice behavior.

## Funding Information

Predoctoral NRSA fellowship, Grant number: F31MH078490. NIH, Grant number: DA047870, EY014924. NDSEG fellowship. NSF, Grant number: EPSCoR Award #1632738.

## Acknowledgments

We thank Vince McGinty for helpful comments on an earlier version of this article. We also thank D. S. Aldrich for technical assistance. This work was supported by NIH Grant EY014924 (T. M.), NIH Grant DA047870 (A. S.), NSF EPSCoR Award 1632738 (A. S.), an NDSEG fellowship (R. J. S.), and predoctoral NRSA fellowship F31MH078490 (R. J. S.).

Reprint requests should be sent to Alireza Soltani, Department of Psychological and Brain Sciences, Dartmouth College, Hanover, NH 03755, or via e-mail: soltani@dartmouth.edu.

## REFERENCES

Abe
,
N.
, &
Takeuchi
,
J.
(
1993
).
The “Lob-Pass” problem and an on-line learning model of rational choice
. In
Proceedings of the 6th Annual Conference on Computational Learning Theory
(pp.
422
428
).
New York
:
Assocation for Computing Machinery
.
Anderson
,
A. K.
,
Christoff
,
K.
,
Stappen
,
I.
,
Panitz
,
D.
,
Ghahremani
,
D. G.
,
Glover
,
G.
, et al
(
2003
).
Dissociated neural representations of intensity and valence in human olfaction
.
Nature Neuroscience
,
6
,
196
202
.
Anderson
,
B. A.
(
2013
).
A value-driven mechanism of attentional selection
.
Journal of Vision
,
13
,
7
.
Anderson
,
B. A.
(
2016
).
The attention habit: How reward learning shapes attentional selection
.
Annals of the New York Academy of Sciences
,
1369
,
24
39
.
Anderson
,
B. A.
,
Laurent
,
P. A.
, &
Yantis
,
S.
(
2011a
).
Value-driven attentional capture
.
Proceedings of the National Academy of Sciences, U.S.A.
,
108
,
10367
10371
.
Anderson
,
B. A.
,
Laurent
,
P. A.
, &
Yantis
,
S.
(
2011b
).
Learned value magnifies salience-based attentional capture
.
PLoS One
,
6
,
e27926
.
Barbaro
,
L.
,
Peelen
,
M. V.
, &
Hickey
,
C.
(
2017
).
Valence, not utility, underlies reward-driven prioritization in human vision
.
Journal of Neuroscience
,
37
,
10438
10450
.
Bari
,
B. A.
,
Grossman
,
C. D.
,
Lubin
,
E. E.
,
Rajagopalan
,
A. E.
,
Cressy
,
J. I.
, &
Cohen
,
J. Y.
(
2019
).
Stable representations of decision variables for flexible behavior
.
Neuron
,
103
,
922
933
.
Barraclough
,
D. J.
,
Conroy
,
M. L.
, &
Lee
,
D.
(
2004
).
Prefrontal cortex and decision making in a mixed-strategy game
.
Nature Neuroscience
,
7
,
404
410
.
Calabresi
,
P.
,
Picconi
,
B.
,
Tozzi
,
A.
, &
Di Filippo
,
M.
(
2007
).
Dopamine-mediated regulation of corticostriatal synaptic plasticity
.
Trends in Neurosciences
,
30
,
211
219
.
Chen
,
X.
, &
Stuphorn
,
V.
(
2015
).
Sequential selection of economic good and action in medial frontal cortex of macaques during value-based decisions
.
eLife
,
4
,
e09418
.
Cooper
,
J. C.
, &
Knutson
,
B.
(
2008
).
Valence and salience contribute to nucleus accumbens activation
.
Neuroimage
,
39
,
538
547
.
,
G. S.
,
Sugrue
,
L. P.
,
Seung
,
H. S.
, &
Newsome
,
W. T.
(
2005
).
Linear–nonlinear–Poisson models of primate choice dynamics
.
Journal of the Experimental Analysis of Behavior
,
84
,
581
617
.
Costa
,
V. D.
,
Dal Monte
,
O.
,
Lucas
,
D. R.
,
Murray
,
E. A.
, &
Averbeck
,
B. B.
(
2016
).
Amygdala and ventral striatum make distinct contributions to reinforcement learning
.
Neuron
,
92
,
505
517
.
Della Libera
,
C.
, &
Chelazzi
,
L.
(
2006
).
Visual selective attention and the effects of monetary rewards
.
Psychological Science
,
17
,
222
227
.
Della Libera
,
C.
, &
Chelazzi
,
L.
(
2009
).
Learning to attend and to ignore is a matter of gains and losses
.
Psychological Science
,
20
,
778
784
.
De Valois
,
R. L.
, &
De Valois
,
K. K.
(
1991
).
Vernier acuity with stationary moving Gabors
.
Vision Research
,
31
,
1619
1626
.
Donahue
,
C. H.
, &
Lee
,
D.
(
2015
).
Dynamic routing of task-relevant signals for decision making in dorsolateral prefrontal cortex
.
Nature Neuroscience
,
18
,
295
301
.
Farashahi
,
S.
,
Azab
,
H.
,
Hayden
,
B.
, &
Soltani
,
A.
(
2018
).
On the flexibility of basic risk attitudes in monkeys
.
Journal of Neuroscience
,
38
,
4383
4398
.
Farashahi
,
S.
,
Donahue
,
C. H.
,
Khorsand
,
P.
,
Seo
,
H.
,
Lee
,
D.
, &
Soltani
,
A.
(
2017
).
Metaplasticity as a neural substrate for adaptive learning and choice under uncertainty
.
Neuron
,
94
,
401
414
.
Farashahi
,
S.
,
Rowe
,
K.
,
Aslami
,
Z.
,
Lee
,
D.
, &
Soltani
,
A.
(
2017
).
Feature-based learning improves adaptability without compromising precision
.
Nature Communications
,
8
,
1768
.
Fuchs
,
A. F.
, &
Robinson
,
D. A.
(
1966
).
A method for measuring horizontal and vertical eye movement chronically in the monkey
.
Journal of Applied Physiology
,
21
,
1068
1070
.
Gallistel
,
C. R.
, &
Gibbon
,
J.
(
2000
).
Time, rate, and conditioning
.
Psychological Review
,
107
,
289
344
.
Gallistel
,
C. R.
,
Mark
,
T. A.
,
King
,
A. P.
, &
Latham
,
P. E.
(
2001
).
The rat approximates an ideal detector of changes in rates of reward: Implications for the law of effect
.
Journal of Experimental Psychology: Animal Behavior Processes
,
27
,
354
372
.
Glimcher
,
P. W.
(
2003
).
The neurobiology of visual-saccadic decision making
.
Annual Review of Neuroscience
,
26
,
133
179
.
Herrnstein
,
R. J.
,
Loewenstein
,
G. F.
,
Prelec
,
D.
, &
Vaughan
,
W.
(
1993
).
Utility maximization and melioration: Internalities in individual choice
.
Journal of Behavioral Decision Making
,
6
,
149
185
.
Herrnstein
,
R. J.
, &
Prelec
,
D.
(
1991
).
Melioration: A theory of distributed choice
.
Journal of Economic Perspectives
,
5
,
137
156
.
Hickey
,
C.
,
Chelazzi
,
L.
, &
Theeuwes
,
J.
(
2010
).
Reward changes salience in human vision via the anterior cingulate
.
Journal of Neuroscience
,
30
,
11096
11103
.
Hickey
,
C.
,
Chelazzi
,
L.
, &
Theeuwes
,
J.
(
2014
).
Reward-priming of location in visual search
.
PLoS One
,
9
,
e103372
.
Hickey
,
C.
, &
Peelen
,
M. V.
(
2017
).
Reward selectively modulates the lingering neural representation of recently attended objects in natural scenes
.
Journal of Neuroscience
,
37
,
7297
7304
.
Hikosaka
,
O.
(
2007
).
Basal ganglia mechanisms of reward-oriented eye movement
.
Annals of the New York Academy of Sciences
,
1104
,
229
249
.
Itti
,
L.
, &
Koch
,
C.
(
2000
).
A saliency-based search mechanism for overt and covert shifts of visual attention
.
Vision Research
,
40
,
1489
1506
.
Jensen
,
J.
,
Smith
,
A. J.
,
Willeit
,
M.
,
Crawley
,
A. P.
,
Mikulis
,
D. J.
,
Vitcu
,
I.
, et al
(
2007
).
Separate brain regions code for salience vs. valence during reward prediction in humans
.
Human Brain Mapping
,
28
,
294
302
.
Judge
,
S. J.
,
Richmond
,
B. J.
, &
Chu
,
F. C.
(
1980
).
Implantation of magnetic search coils for measurement of eye position: An improved method
.
Vision Research
,
20
,
535
538
.
Kim
,
S.
,
Hwang
,
J.
,
Seo
,
H.
, &
Lee
,
D.
(
2009
).
Valuation of uncertain and delayed rewards in primate prefrontal cortex
.
Neural Networks
,
22
,
294
304
.
Lau
,
B.
, &
Glimcher
,
P. W.
(
2005
).
Dynamic response-by-response models of matching behavior in rhesus monkeys
.
Journal of the Experimental Analysis of Behavior
,
84
,
555
579
.
Lau
,
B.
, &
Glimcher
,
P. W.
(
2007
).
Action and outcome encoding in the primate caudate nucleus
.
Journal of Neuroscience
,
27
,
14502
14514
.
Ledbetter
,
N. M.
,
Chen
,
C. D.
, &
Monosov
,
I. E.
(
2016
).
Multiple mechanisms for processing reward uncertainty in the primate basal forebrain
.
Journal of Neuroscience
,
36
,
7852
7864
.
Li
,
J.
,
McClure
,
S. M.
,
King-Casas
,
B.
, &
Montague
,
P. R.
(
2006
).
Policy adjustment in a dynamic economic game
.
PLoS One
,
1
,
e103
.
Lin
,
S.-C.
, &
Nicolelis
,
M. A. L.
(
2008
).
Neuronal ensemble bursting in the basal forebrain encodes salience irrespective of valence
.
Neuron
,
59
,
138
149
.
Liston
,
D. B.
, &
Stone
,
L. S.
(
2008
).
Effects of prior information and reward on oculomotor and perceptual choices
.
Journal of Neuroscience
,
28
,
13866
13875
.
Litt
,
A.
,
Plassmann
,
H.
,
Shiv
,
B.
, &
Rangel
,
A.
(
2011
).
Dissociating valuation and saliency signals during decision-making
.
Cerebral Cortex
,
21
,
95
102
.
Mark
,
T. A.
, &
Gallistel
,
C. R.
(
1994
).
Kinetics of matching
.
Journal of Experimental Psychology: Animal Behavior Processes
,
20
,
79
95
.
Markowitz
,
D. A.
,
Shewcraft
,
R. A.
,
Wong
,
Y. T.
, &
Pesaran
,
B.
(
2011
).
Competition for visual selection in the oculomotor system
.
Journal of Neuroscience
,
31
,
9298
9306
.
Maunsell
,
J. H. R.
(
2004
).
Neuronal representations of cognitive state: Reward or attention?
Trends in Cognitive Sciences
,
8
,
261
265
.
Maunsell
,
J. H. R.
(
2015
).
Neuronal mechanisms of visual attention
.
Annual Review of Vision Science
,
1
,
373
391
.
Monosov
,
I. E.
(
2020
).
How outcome uncertainty mediates attention, learning, and decision-making
.
Trends in Neurosciences
,
43
,
795
809
.
Navalpakkam
,
V.
,
Koch
,
C.
,
Rangel
,
A.
, &
Perona
,
P.
(
2010
).
Optimal reward harvesting in complex perceptual environments
.
Proceedings of the National Academy of Sciences, U.S.A.
,
107
,
5232
5237
.
Noudoost
,
B.
, &
Moore
,
T.
(
2011
).
Control of visual cortical signals by prefrontal dopamine
.
Nature
,
474
,
372
375
.
Peck
,
C. J.
,
Jangraw
,
D. C.
,
Suzuki
,
M.
,
Efem
,
R.
, &
Gottlieb
,
J.
(
2009
).
Reward modulates attention independently of action value in posterior parietal cortex
.
Journal of Neuroscience
,
29
,
11182
11191
.
Platt
,
M. L.
, &
Glimcher
,
P. W.
(
1999
).
Neural correlates of decision variables in parietal cortex
.
Nature
,
400
,
233
238
.
Rakhshan
,
M.
,
Lee
,
V.
,
Chu
,
E.
,
Harris
,
L.
,
Laiks
,
L.
,
Khorsand
,
P.
, et al
(
2020
).
Influence of expected reward on temporal order judgment
.
Journal of Cognitive Neuroscience
,
32
,
674
690
.
Roesch
,
M. R.
, &
Olson
,
C. R.
(
2004
).
Neuronal activity related to reward value and motivation in primate frontal cortex
.
Science
,
304
,
307
310
.
Schafer
,
R. J.
, &
Moore
,
T.
(
2007
).
Attention governs action in the primate frontal eye field
.
Neuron
,
56
,
541
551
.
Schultz
,
W.
(
2007
).
Multiple dopamine functions at different time courses
.
Annual Review of Neuroscience
,
30
,
259
288
.
Schütz
,
A. C.
,
Trommershäuser
,
J.
, &
Gegenfurtner
,
K. R.
(
2012
).
Dynamic integration of information about salience and value for saccadic eye movements
.
Proceedings of the National Academy of Sciences, U.S.A.
,
109
,
7547
7552
.
Serences
,
J. T.
(
2008
).
Value-based modulations in human visual cortex
.
Neuron
,
60
,
1169
1181
.
Soltani
,
A.
,
Noudoost
,
B.
, &
Moore
,
T.
(
2013
).
Dissociable dopaminergic control of saccadic target selection and its implications for reward modulation
.
Proceedings of the National Academy of Sciences, U.S.A.
,
110
,
3579
3584
.
Soltani
,
A.
, &
Wang
,
X.-J.
(
2006
).
A biophysically based neural model of matching law behavior: Melioration by stochastic synapses
.
Journal of Neuroscience
,
26
,
3731
3744
.
Soltani
,
A.
, &
Wang
,
X.-J.
(
2008
).
From biophysics to cognition: Reward-dependent adaptive choice behavior
.
Current Opinion in Neurobiology
,
18
,
209
216
.
Squire
,
R. F.
,
Noudoost
,
B.
,
Schafer
,
R. J.
, &
Moore
,
T.
(
2013
).
Prefrontal contributions to visual selective attention
.
Annual Review of Neuroscience
,
36
,
451
466
.
Strait
,
C. E.
,
Blanchard
,
T. C.
, &
Hayden
,
B. Y.
(
2014
).
Reward value comparison via mutual inhibition in ventromedial prefrontal cortex
.
Neuron
,
82
,
1357
1366
.
Sugrue
,
L. P.
,
,
G. S.
, &
Newsome
,
W. T.
(
2004
).
Matching behavior and the representation of value in the parietal cortex
.
Science
,
304
,
1782
1787
.
Sugrue
,
L. P.
,
,
G. S.
, &
Newsome
,
W. T.
(
2005
).
Choosing the greater of two goods: Neural currencies for valuation and decision making
.
Nature Reviews Neuroscience
,
6
,
363
375
.
Volkow
,
N. D.
,
Wang
,
G.-J.
,
Kollins
,
S. H.
,
Wigal
,
T. L.
,
Newcorn
,
J. H.
,
Telang
,
F.
, et al
(
2009
).
Evaluating dopamine reward pathway in ADHD: Clinical implications
.
Journal of the American Medical Association
,
302
,
1084
1091
.

## Author notes

*

These authors contributed equally to this work.