Abstract

The temporal difference (TD) learning framework is a major paradigm for understanding value-based decision making and related neural activities (e.g., dopamine activity). The representation of time in neural processes modeled by a TD framework, however, is poorly understood. To address this issue, we propose a TD formulation that separates the time of the operator (neural valuation processes), which we refer to as internal time, from the time of the observer (experiment), which we refer to as conventional time. We provide the formulation and theoretical characteristics of this TD model based on internal time, called internal-time TD, and explore the possible consequences of the use of this model in neural value-based decision making. Due to the separation of the two times, internal-time TD computations, such as TD error, are expressed differently, depending on both the time frame and time unit. We examine this operator-observer problem in relation to the time representation used in previous TD models. An internal time TD value function exhibits the co-appearance of exponential and hyperbolic discounting at different delays in intertemporal choice tasks. We further examine the effects of internal time noise on TD error, the dynamic construction of internal time, and the modulation of internal time with the internal time hypothesis of serotonin function. We also relate the internal TD formulation to research on interval timing and subjective time.

1.  Introduction

The framework of temporal difference (TD) learning is central to the reinforcement learning paradigm (Sutton & Barto, 1998) and has become a major platform for investigating the neural basis of value-based decision making and reward-oriented behavior. The TD framework has greatly improved our understanding of these neural functions and related neural activities, most notably that of dopamine (DA) neurons (Schultz, 1998; Montague, Hyman, & Cohen, 2004; Hikosaka, Nakamura, & Nakahara, 2006). The representation of time in TD learning models for neural value-based decision making, however, remains poorly understood (Gibbon, Malapani, Dale, & Gallistel, 1997; Dayan & Niv, 2008). To understand the effects of time representation on neural valuation, we propose separating the time used in experiments (the time of the observer) from the time of neural valuation processes (or TD model, the time of the operator). We reformulated the TD learning framework using the operator's time and found that this new framework helps to clarify several issues in neural valuation. Here we explain our motivation for conducting this study by describing issues related to our subject.

First, when TD models are applied to neural valuation, they usually use a discrete time formulation. Therefore, we must clarify the relation between discrete time and continuous time, because time by nature is continuous, and experiments are thus intrinsically performed in continuous time. Accordingly, in this study, we make a clear connection between discrete-time and continuous-time TD models.

Second, when discrete-time TD models are applied to neural valuation, two essential roots are used for the treatment of time: time is either an operational unit or a duration unit, or both. Later, we demonstrate that a discrete-time unit, once connected to continuous time, plays a dual role as a unit for both operation and duration (see section 2.2). Conceptually, once we view a TD model's operations in the light of Markov decision processes (MDPs) (Sutton & Barto, 1998), a discrete-time unit acts as an operational unit, because each increment of the unit induces almost all the operations involved in the TD model (e.g., state transition). In relation to continuous time, a discrete-time step may also act as a duration unit. A clear example is the tapped-delay-line representation for modeling conditioning behavior (e.g., Desmond & Moore, 1988; Sutton & Barto, 1990): all time steps are typically considered to have the same length in continuous time and are used as placeholders between externally salient events. These events are considered to evoke a series of inputs to the TD models, which act as inputs in the time steps between the events and thus become time representations in the steps. Markov properties are usually assumed with respect to these events (Sutton & Barto, 1990; Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997). Provided that this assumption is valid and the event-evoked inputs representing time have sufficient representational capacity, we can treat the times between events under MDPs for TD models.

With this understanding, we observe that for application to neural valuation, time evolution is not a generic part of the discrete-time TD formulation using MDPs (Daw, Courville, & Tourtezky, 2006, being a notable exception). To address this issue, this study sets the time evolution of neural valuation processes directly in the TD formulation. For this purpose, it is important to note that time is a difficult concept to define, but at the very least, it must be an entity that is independent of the methods of measurement or coordinate systems. Given this invariant principle, we must mention that time, as we think we know it, is our own time system; it is indeed only one method of measuring time. We call this time system conventional time, and it is regarded as the time of the experiment or the observer, as we observe, record, and discuss experiments and their results mostly in conventional time.

Third, the discrete-time unit of neural valuation's operation process can be different from the conventional discrete-time unit. Results from a wide range of experiments are broadly consistent with this view, although they may be directly focused not on operations of neural valuation but only on neural operations in general. Neural time processing is not a unitary system but involves several neural systems (Ivry & Spencer, 2004; Mauk & Buonomano, 2004). Various systems differentially contribute to different timescales (Rammsayer, 1999; Buhusi & Meck, 2005; Buonomano, 2007). Time may be processed differently for different modalities, yielding perception, decision, or motor times (Lewis & Miall, 2003; Gold & Shadlen, 2007; Nobre, Correa, & Coull, 2007; Eagleman, 2008). Time processing often contains errors, characterized by Weber's law, the scalar property (Gibbon, 1977; Gallistel & Gibbon, 2000), and other such errors (Matell & Meck, 2000; Nakahara, Nakamura, & Hikosaka, 2006). Even the same time interval can be processed differently, depending on the initial neural activity of the process (Karmarkar & Buonomano, 2007). Furthermore, consider the prospective use of time on a long timescale. If the conventional-time unit is the operator's unit, it must always be processed in the same way at any time; a second—whether now or 1 month later—must be processed in the same way. This proposition, admittedly naively stated, is somewhat difficult to accept. Indeed, several studies point to different neural time processing of such prospections (e.g., mental time travel, constructing representations of future events, and similar mental activities regarding the future (Gilbert & Wilson, 2007; Szpunar, Watson, & McDermott, 2007; Buckner & Carroll, 2007; Arzy, Molnar-Szakacs, & Blanke, 2008; Boyer, 2008; Liberman & Trope, 2008).

Taking these considerations into account, in this study, we treat the time system of the neural valuation's operation process as distinct from the system of the observer, or the conventional time system. For simplicity and clarity, we call the time of the neural valuation's operation process in the TD formulation (or the operator's time) specifically internal time. This study investigates what we call the internal-time TD formulation (internal TD) that constructs a TD model using internal time, together with its theoretical characteristics and possible consequences in neural valuation. We concentrate on the most basic formulation of the TD model— the case of discounted rewards without an eligibility trace. We also contrast internal TD with what we call conventional-time TD (or conventional TD), which is formulated using conventional time. We are particularly interested in how internal TD behaves in discrete conventional time—the time in which experimental observations are most likely made. Two fundamental elements of TD models are extensively used when applied to neural valuation: one is exponential discounting, which is a long-timescale property of a TD model, and the other is reward prediction error (TD error), which is a short-timescale property of a TD model reflecting its nature as an online learning algorithm. We investigate both properties of the internal TD.

Clarification at this point may be helpful for relating this study to issues investigated in the interval timing literature (Gibbon et al., 1997; Buhusi & Meck, 2005). In our view, while the issues examined in this study and the interval timing literature are closely related, the focus of this study is distinct from that of interval timing. A central issue in the interval timing literature is how time intervals are subjectively timed (e.g., estimated, produced, or reproduced by subjects); accordingly, these studies often distinguish between objective and subjective time, whereby objective time is the time of the experiment/observer (what we call conventional time) and subjective time is considered different from objective time. Subjective time is used as a theoretical construct to explain the subjective timings or timing behaviors of the subjects. Several different mappings between the two time systems are proposed in the literature, (e.g. Gibbon, 1977; Killeen & Fetterman, 1988; Staddon & Higa, 1999; Buhusi & Meck, 2005). Most timing models in the literature are used to examine these possible mappings and the implications for behavior through the perspective of timing behaviors (Church, 1984; Killeen & Fetterman, 1988; Staddon & Higa, 1999; Gallistel & Gibbon, 2000; Grondin, 2001; Dragoi, Staddon, Palmer, & Buhusi, 2003; Cerutti & Staddon, 2004; Buhusi & Meck, 2005). In other words, a distinction between subjective and objective time is the distinction between two different time systems through timing behaviors. On the other hand, such subjective timing or timing behavior is not a generic part of conventional TD models applied to neural valuation because conventional TD models regard valuation and action selection as primary interests and time is treated as only an auxiliary variable. Thus, although studies of interval timing and studies using these TD models often examine related or similar experimental tasks and thus often ask closely related questions, they address different central questions (Montague et al., 1996; Gallistel & Gibbon, 2002), such as, subjective timing versus valuation (Daw et al., 2006, being a notable exception). This study investigates a most basic TD formulation using internal time, which is distinct from conventional time, in that internal time is defined as the time of neural valuation processes to which TD models are applied. In this sense, the distinction between conventional and internal time is the one between two time systems through neural valuation. With this perspective, we consider that the two distinctions (objective versus subjective time and conventional versus internal time) are closely related because studies of timing models and TD models are often closely related to each other, and thus, if taken rather broadly, the two distinctions might appear to be the same. In this study, however, we consider it beneficial to maintain the different distinctions in a strict sense, because the two lines of studies approach central questions differently, and it is currently unclear how subjective timing should be mapped onto or treated as part of neural valuation or internal time under TD formulation. In section 5, we consider how subjective timing can be included in the internal TD formulation for future research.

We first summarize most formulations of the internal-time TD (see section 2), including different discrete-time TD error expressions and what we call the operator-observer problem. In section 3, we investigate the implications of internal TD for neural valuation. One implication is the relationships of different TD error expressions to the time representations used in previous TD models (see section 3.1). In section 3.2, using intertemporal choice tasks, an internal TD is shown to exhibit the co-appearance of exponential and hyperbolic discounting at different delays, which accounts for the observed choice reversal, and also to be decomposed into multiple subsystems. Further consideration is given to the short-timescale properties of internal TD (see section 4), namely, the effect of the ongoing noise in internal time on TD error (see section 4.1), dynamic internal time construction (see section 4.2), and internal time modulation, along with an internal-time hypothesis of serotonin neuronal functions (see section 4.3). Finally, section 5 contains the discussion.

2.  Internal-Time TD Formulation

We begin in section 2.1 by clarifying the general relationship between two time systems and then the relationship of the discrete time used by TD models to continuous time. This provides a basis for comparing internal and conventional TDs in section 2.2. We then summarize different expressions of discrete-time internal TD error in section 2.3 and the operator-observer problem in section 2.4.

2.1.  Preliminaries.

To clarify the general relationship among different time systems, we denote the two time systems by y and t and their discrete-time constant units Δ y and Δ t respectively. Although in later sections, we specify y and t as internal and conventional time, respectively, they can be regarded in this section as any time system. The fundamental relationship between any two time systems is schematically shown in Figure 1. It is critical to understand that a constant unit of one time system may have different lengths in another time system. This characteristic affects our understanding of neural valuation when the operator's time is distinct from the observer's time. To see this nature mathematically, let us define the relationship between the two time systems (y-system and t-system, respectively) by
formula
where f is assumed to be monotonically increasing and differentiable (this second condition is for simplicity, because it can be relaxed), thus yielding
formula
2.1
Note that in general, to define a function y=f(t), we must first define the reference frame and origin (e.g., if the origin of t, or more generally if the reference point of t changes, the form of f(.) changes). Hence, we also sometimes write , where is the reference point (origin), but only when clarification is necessary in sections 3.2 and 4.2. When the y-system is represented by the t-system (i.e., when Δ t is constant; see Figure 1, bottom), Δ y becomes a function of t, , and thus equation 2.1 is read as . Consequently, varies according to . In the opposite case, when Δ y is constant (when the t-system is represented by the y-system), equation 2.1 is read as
formula
2.2
so that now varies according to .
Figure 1:

Relationship between two time systems. Time itself is independent from any measurement system (top). It can be measured and represented within different measurement systems. We show here two systems: the t-system on the right and the y-system on the left (middle). Although they are shown here as general systems, they may be considered to correspond to conventional time and internal time, respectively, especially in the descriptions given in later sections. Each of the two systems has its own unit of time (Δ t and Δ y, respectively) that is constant within its own system. Once one system is represented by the other, however, the units of the first system may have different lengths in the second system (bottom). For example, once mapped onto the t-system, Δ y may have different lengths. Three different such mappings are shown.

Figure 1:

Relationship between two time systems. Time itself is independent from any measurement system (top). It can be measured and represented within different measurement systems. We show here two systems: the t-system on the right and the y-system on the left (middle). Although they are shown here as general systems, they may be considered to correspond to conventional time and internal time, respectively, especially in the descriptions given in later sections. Each of the two systems has its own unit of time (Δ t and Δ y, respectively) that is constant within its own system. Once one system is represented by the other, however, the units of the first system may have different lengths in the second system (bottom). For example, once mapped onto the t-system, Δ y may have different lengths. Three different such mappings are shown.

Next, we illustrate the most typical or standard relationship between continuous-time and discrete-time TD formulations (Puterman, 1994; Bradtke & Duff, 1995; Doya, 2000) by showing the discrete-time TD error as a discretized approximation of the continuous-time TD error, although there are other formalisms, notably the semi-Markov approach (Puterman, 1994; Daw et al., 2006). The continuous-time value function is defined by
formula
2.3
where and τ is a time constant. Differentiating this equation with respect to y (i.e., ) and subtracting both sides, the continuous-time TD error is given as
formula
2.4
Let us set Δ y as the discretizing unit. We substitute in equation 2.4, together with one increment and have
formula
2.5
This equation indicates the standard relationship of the two TD errors. Comparing equation 2.4with equation 2.5, we see the intrinsic dimension of both reward function r(y) and δ (y) density (per time), whereas k(y) and V(y) are scalar functions. This affects discrete-time TD error expressions.
Note that to derive the standard relationship (see equation 2.5) from equation 2.4, the discretizing unit (Δ y) was specifically chosen to be the same as the unit of the time system for the continuous-time value function in equation 2.3. It is generally possible, however, to choose a discretizing unit that is different from the value function's unit. In this case, using the discretizing unit from another time system, Δ t, and writing y=y(t)=f(t), we have
formula
2.6
This equation represents a more general relationship of the discrete-time TD error with the continuous-time value function or TD error. It is noteworthy that in equation 2.6, the discretizing unit Δ y (t) is now variable in the y-system.

2.2.  Internal-Time and Conventional-Time TD Formulations.

The internal TD constructs value function in internal time, whereas the conventional TD model constructs value function in conventional time. Hereafter, we use y specifically to indicate internal time and t to indicate conventional time. To contrast the two TD models, we first briefly summarize conventional TD. Conventional TD does not distinguish y and t, thus effectively constructing its continuous-time value function in conventional time. When y=t is set in equation 2.3, the value function is
formula
where we used s inside the integral to indicate integration in the t-system, while reserving x for integration in the y-system (e.g., equation 2.3). When we set y=t and Δ y = Δ t, the discrete-time TD error is given by equation 2.5. To write it in a familiar form, we first define an index function i(t) that returns the index of the discrete time steps, given a continuous time t (in the t-system); i(t) = tt and in the continuous time, i(t) refers to a period of continuous time [t−Δ t, t]. Then the TD error is given by
formula
2.7
where we dropped Δ t on the left-hand side (i.e., from δi(t + Δ t) Δ t) for simplicity and define the discount factor, γ, by
formula
Because Δ t is conventionally hidden, equation 2.7 is usually written as
formula
2.8
where we now have .

Here Δ t, being a fixed constant, plays dual roles. First, it acts as a duration unit, thereby connecting continuous time to discrete time and permitting the use of a hiding convention. Second, it is also an iteration or operation unit of the discrete-time TD model, as almost all TD operations are defined with respect to iteration and, moreover, being constant, it also makes the discount factor constant. The TD error expressed in the hiding convention equation 2.8, hides these issues.

In contrast, distinguishing y from t, the internal TD is constructed using the internal time system. Here we consider yt in general, or at the very least, that the conventional TD's assumption y=t does not always hold. The continuous-time internal TD value function is given by equation 2.3. It is rewritten in the t-system as
formula
V(y), an exponentially discounted function in the y-system, is no longer necessarily exponentially discounted in the t-system. This simple example restates the fact that even before discretization, a function constructed in one time system can be expressed differently in another time system.
To see the discrete-time TD error of the internal TD (see equation 2.5) in a familiar form, we define another index function j(y) using the y-system (i.e., Δ y), such that j(y) = yy and j(y) refers to the period [y−Δ y, y], and the discount factor is given by
formula
Then, the internal TD's discrete-time TD error is given by
formula
2.9
where Δ y is dropped on the left-hand side. If we use the hiding convention, equation 2.9 is written as , which, superficially, is identical to equation 2.8. However, they can be different TD errors, depending on the function f.

2.3.  Different Expressions of Discrete-Time Internal TD Error.

Here we summarize different expressions and meanings of discrete-time TD errors of the internal TD (see Table 1). The internal TD's discrete-time TD error in equation 2.9 is discretized by using Δ y and expressed in the y-system (see Table 1). This TD error may be regarded as a sort of proper discrete-time TD error of the internal TD, in the sense that the same time system (internal time) underlies both the value function construction and the discretizing unit. The same discrete-time TD error can still be expressed in the t-system, and it is given in Table 1 (bottom-left entry), which we can derive by using t=t(y)=f−1(y) and and noting that for any t (due to equation 2.2).

Table 1:
Different Expressions of Internal TD and Their Respective TD Errors.
Operator's View: Unit Δ yObserver's View: Unit Δ t
Time System
to ExpressDurationDiscount FactorDurationDiscount Factor
Internal time Δ y, constant γy, constant Δ y(t), variable  variable 
   
   
Conventional time Δ t(y), variable  constant  constant  variable 
   
   
Operator's View: Unit Δ yObserver's View: Unit Δ t
Time System
to ExpressDurationDiscount FactorDurationDiscount Factor
Internal time Δ y, constant γy, constant Δ y(t), variable  variable 
   
   
Conventional time Δ t(y), variable  constant  constant  variable 
   
   

The continuous-time internal TD value function can also be discretized using the unit of the t-system, Δ t, as shown in equation 2.6. In this case, we use the index function for the t-system, i(t), and also note that . Then equation 2.6 can be expressed in both the t- and y-systems as shown in Table 1 (right column).

2.4.  Different TD Error Expressions and Operator-Observer Problems.

We found that the discrete-time TD error expression of the internal TD depends on both the time system in which it is expressed (y- or t-system) and the time unit by which it is discretized (Δ y or Δ t), leading to 2 × 2 expressions (see Table 1). In contrast, the conventional TD error (see equation 2.7) is a special case of the internal TD, when we do not distinguish between the operator's and the observer's times (i.e., equating the two time systems, y=t, or effectively a linear function, y=ct, where c is a constant; see the appendix). In this case, all TD error expressions become equivalent to one another (see equation 2.7).

Dissociating the two time systems raises several issues. Most importanty, this dissociation, together with the internal TD formulation, makes what we call the operator-observer problem explicit: we should clarify whether the model is constructed using the operator's or the observer's point of view. Taking the operator's view, the model is used to directly formulate neural valuation processes per se. In contrast, taking the observer's view, the model is used to construct a description of the processes but not necessarily to directly formulate the processes themselves.

The time of the operator or neural valuation processes is by definition internal time, and therefore, we construct value functions in internal time, not in conventional time, under our internal TD formulation. Given continuous-time internal TD value functions, the distinction between the operator's and the observer's views differentiates what discrete-time internal TD error expressions should be used for modeling. The internal time unit (Δ y) and the conventional time unit (Δ t) have specific meanings for the internal TD error expressions: the former is the operator's unit and the latter is the observer's unit. The TD error in the operator's view is expressed differently, depending on whether it is expressed in the y- or t-system (left-side entries in Table 1). The discount factor is the same as γy in both expressions, while the duration is constant (Δ y) in internal time but variable (Δ t(y); see equation 2.2) in conventional time. The TD error in the observer's view is shown on the right side of Table 1. The discount factor is no longer constant but is a function of t. Duration is constant (Δ t) in the t-system but variable (Δ y(t)) in the y-system. The difference in the changes over the two time systems between the two views comes from the fact that r(y) and k(y) intrinsically have different dimensions. Because experimental observations are usually made, and thus fitted by TD models, mostly in conventional time, it is useful to compare the TD error expressions of the operator's and the observer's views in conventional time (bottom entries in Table 1). A succinct summary (but noting that internal time also affects the value function) is that the effect of internal time is expressed by a variable Δ t (y) in the operator's view and by a variable discount factor in the observer's view. This characteristic is partly related to the interchangeability between the discrete-time unit and the discount factor, which exists even within the conventional TD (see the appendix).

3.  Implications of Internal TD for Neural Valuation

Here we first relate the time representations most typically used in the previous TD models to different expressions of internal TD error in section 3.1. We then investigate the effect of internal time on discounting in an intertemporal choice task in section 3.2.

3.1.  Relationships of Different Time Representations of Previous TD Models to Different Internal TD Error Expressions.

3.1.1.  Previous TD Models Using Different Time Representations.

The internal TD formulation provides us with the flexibility to unify the different time representations of previous TD models used in neural valuation studies. Among these, we focus on three representative approaches: the equally dissected, event-dissected, and semi-Markov models (see Table 2), based on their characterizations in conventional time, because the distinction between internal and conventional time was not made in previous studies. The three approaches are summarized based on how they treat time as well as on their forms of TD error.

Table 2:
Three Types of Previously Used Time Representations.
TypeDurationDiscount FactorTD Error
Equally dissected Δ t γ  
    
Event-dissected Δ t(yγy  
    
Semi-Markov Δ t(y  
    
TypeDurationDiscount FactorTD Error
Equally dissected Δ t γ  
    
Event-dissected Δ t(yγy  
    
Semi-Markov Δ t(y  
    

The equally dissected model uses an equally sized or constant unit for time representation in conventional time, thus maintaining a rigid relationship between a discrete-time step and a small continuous-time period. It thus uses Δ y = Δ t or, equivalently, , and the corresponding TD error is given by equation 2.7 (see Table 2). A representative model in this approach is often called a tapped-delay-line model (e.g., Desmond & Moore, 1988; Sutton & Barto, 1990). The tapped-delay-line model treats time representation as a part of the stimulus representation and has been adopted for modeling DA activity by TD models since it was first proposed (Houk, Adams, & Barto, 1995; Montague et al., 1996; Schultz et al., 1997), and has been extensively used in subsequent studies, including the examination of various neurophysiological and function magnetic resonance imaging (fMRI) studies on neural valuation (Suri, 2001; Suri & Schultz, 2001; O'Doherty, Dayan, Friston, Critchley, & Dolan, 2003; Tanaka et al., 2004; Pan, Schmidt, Wickens, & Hyland, 2005).

It is also common for the discrete time steps of a TD model (and related models) to be adjusted by events in experiments, and, in the simplest case, sometimes only salient external events are mapped to the TD model's steps. We call this modeling approach an event-dissected model. This approach has also been extensively used to analyze both fMRI and neurophysiological studies (Nakahara, Doya, & Hikosaka, 2001; Nakahara, Itoh, Kawagoe, Takikawa, & Hikosaka, 2004; Bayer & Glimcher, 2005; Glascher & Buchel, 2005; Haruno & Kawato, 2006; Seymour, Daw, Dayan, Singer, & Dolan, 2007; Lohrenz, McCabe, Camerer, & Montague, 2007). An increment of the model's discrete time step is adjusted with events of the experiment so that computations of variables of interest are adjusted accordingly. Thus, the time duration between events can be variable in the model, while the discount factor is kept constant, regardless of the duration between events. If we denote the duration of each event, eventk, in the t-system by , we can write , where yk and tk refer to the onset of an eventk in the y- and t-systems, respectively. Then we can write
formula
and the corresponding TD error is then given as in Table 2.
Daw et al. (2006) introduced a type of semi-Markov formulation to model neural valuation. This formulation was combined with approaches using average TD formulations (Daw & Touretzky, 2002; Tsitsiklis & Van Roy, 2002). First, we examine the nature of the semi-Markov property. In this model, the probabilistic structure of the time between the events of an experiment is given. Each event of the experiment, eventk, causes a state transition. At the transition, the duration of the state, , is probabilistically determined, where the underlying probability distribution is prespecified (possibly differently) with respect to each eventk. Thus, is written as , where zk is a sample of a random variable Zk of the prespecified distribution Pk. The semi-Markov model is then expressed as an internal TD by
formula
In the case of discounted rewards, the discount factor is arranged to reflect variable durations (Puterman, 1994; Bradtke & Duff, 1995) (and reward as well, which we omit in the following) by using the constant unit Δ t, so the discount factor is re-expressed as , and thus the corresponding TD error (see Table 2) is given by
formula
3.1
As mentioned earlier, Daw et al. (2006) did not use this TD error but rather a TD error in the average TD. This complicates the direct comparison of their semi-Markov average TD formulation with our internal TD formulation. Hence, we examine the semi-Markov approach under the discounted case in the subsequent sections (but also see section 5).

3.1.2.  Remarks on Different Expressions for Internal TD Error and on Time Representations in Previous TD Models.

The three models just discussed collectively represent the approaches most typically used for time representation when the operator's and the observer's times are not dissociated. If we forced the discrete-time conventional TD model to have a rigid relation to continuous time, it would be an equally dissected TD model. For example, the TD or related models used in fMRI studies are considered to be of this type if all sampled time points of the model have equal durations, including the intertrial interval (ITI). The model's time representation implies that any variability in the time events in an experiment affects the behavior of the TD model, possibly making it difficult for the model to account for neural data. A simple example can be seen in experiments with a variable ITI (Daw et al., 2006). This is usually avoided by using an episodic task schedule (Sutton & Barto, 1998), whereby the model's simulation is run, ignoring the variable ITI. An equally dissected model with such an episodic schedule becomes similar to an event-dissected model once the ignored variable ITI is considered to be part of the simulations, which have variable durations, by the latter model. Accordingly, the distinction between these two models is often blurred. If we let the discrete-time conventional TD model have a rather flexible relation to duration, it would become an event-dissected or semi-Markov model (see Table 2), and it is possible to have an event-dissected model determine the variable duration in the same way as in the semi-Markov model and, conversely, let the semi-Markov model determine the duration using the same method as the event-dissected model. Therefore, the most critical difference between the semi-Markov and event-dissected models is whether the discount factor is adjusted by each duration (measured by conventional time). The semi-Markov model treats a unit of operation (or duration) in the same way as the event-dissected TD does but computes the discount factor using conventional instead of internal time.

Note that the internal TD errors from the operator's view, expressed in internal time (see Table 1), may correspond to the TD error of the equally dissected internal TD (see Table 2). Once re-expressed in conventional time, the equally dissected internal TD error has the same form as the event-dissected TD error (see Table 2). This implies that the event-dissected TD model is a subclass of the equally dissected internal TD model, because it specifically defines the operator's unit as corresponding to duration among external events. Thus, this provides one justification, albeit only partially, for using the event-dissected TD model to examine neural valuation; the model makes an explicit choice using variable durations as units of the neural operation process.

3.2.  Value Function Discounted Exponentially in Short Delay and Hyperbolically in Long Delay.

The TD framework (in the case of discounted rewards) suggests that a temporally distant reward is valued with exponential discounting (Sutton & Barto, 1998). More generally, exponential discounting has theoretically favored properties, often regarded as “rational” (Samuelson, 1937; Montague & Berns, 2002; Mazur, 2006). Studies in psychology, neuroscience, and economics, however, often indicate that the subject's behavior reveals not exponential but hyperbolic discounting (Ainslie, 1975; Thaler & Shefrin, 1981). Questions regarding whether exponential or hyperbolic discounting is appropriate have been long debated (Loewenstein & Prelec, 1992; Frederick, Loewenstein, & O'Donoghue, 2002; Berns, Laibson, & Loewenstein, 2007), and recently fMRI studies, in which investigators directly compare behavior with neural (BOLD) activity in different brain areas, have highlighted these issues (Montague, King-Casas, & Cohen, 2006).

A hallmark of hyperbolic discounting is choice reversal in intertemporal choice tasks (Ainslie, 1975; Thaler & Shefrin, 1981; Laibson, 1997; Frederick et al., 2002). The basic nature of hyperbolic discounting, compared to exponential discounting, is that the discount rate is relatively steep at short delays but shallow at long delays. This can account for choice reversal, as detailed below. On the other hand, Schweighofer et al. (2006) advanced an interesting, contrary viewpoint, in which they used a variant of an intertemporal choice task, but with a much shorter time range, and found that discounting is not hyperbolic but exponential. Given these studies, we first ask, regarding internal TD, whether it is possible to exhibit something similar to hyperbolic discounting in conventional time: more specifically, exponential discounting during a short delay but hyperbolic discounting during a long delay in conventional time. We show in section 3.2.2 that this is the case and also that choice reversal occurs.

At first glance, this finding may appear at odds with several fMRI studies investigating discounting functions (Tanaka et al., 2004; McClure, Laibson, Loewenstein, & Cohen, 2004; McClure, Ericson, Laibson, Loewenstein, & Cohen, 2007). Those studies showed that multiple brain areas generally play a role in such a decision and further suggested that different areas play a dominant role in decision making on different timescales: for example, two valuation subsystems, in which one is more involved in immediate reward and the other in delayed reward. This suggestion was made in relation to the view that even when each subsystem uses exponential discounting, hyperbolic discounting may occur if multiple subsystems are involved in the decision (Laibson, 1997; Montague et al., 2006; Fudenberg & Levine, 2006; Berns et al., 2007). They thus examined whether two subsystems, each using exponential discounting, can account for the behavior in an intertemporal choice task better than a single valuation system using exponential discounting (McClure et al., 2007; Glimcher, Kable, & Louie, 2007). In contrast, our findings (shown in section 3.2.2) indicate that under the internal TD formulation, single exponential discounting (in internal time) could induce a hyperbolically discounted behavior (in conventional time), suggesting that multiple subsystems are unnecessary. This is broadly consistent with an alternative view based on experimental findings (Kable & Glimcher, 2007), in which the authors claim that there are no differentially activated multiple systems but instead a single system, over several brain areas, that directly encodes a subjective or hyperbolically discounted value (but see also a recent criticism: Hare, O'Doherty, Camerer, Schultz, & Rangel, 2008). The internal TD might underlie the neural process of the single system.

Nevertheless, we consider that even when the internal TD is the underlying neural mechanism, our finding does not necessarily contradict the possible existence of multiple subsystems in neural valuation (Laibson, 1997; Tanaka et al., 2004; McClure et al., 2004, 2007; Fudenberg & Levine, 2006). To make this point, we address two questions in section 3.2.3, assuming that the internal TD underlies or generates the behavioral data. The first question is whether the summation of two exponential discounting functions produces a better fit to the data than single exponential discounting, as reported in previous experimental studies. Second, we present one way to decompose the internal TD to multiple subsystems in section 3.2.1 and ask whether the subsystems show differential dominant roles at different timescales, as reported in previous experimental studies.

3.2.1.  Mathematical Formulation.

Here we summarize the mathematical formulations used in both sections 3.2.2 and 3.2.3. We first construct, based on mathematical intuition, k(f(t)), which yields the discounting function discussed above. If k(f(t)) behaves approximately like
formula
3.2
we obtain our desired k(f(t)). Here, it is assumed that the origin, when t is (nearly) equal to zero, is the time of valuation. A good approximation for such a construction (Zhang, 1996) is that given , we have for a sufficiently large t > 0 and for a sufficiently small t<0, and thus and . Hence, we expect that the following discounting function,
formula
with appropriate scaling and translating parameters, a and b, behaves similar to equation 3.2. The corresponding internal time y=f(t) is then given by
formula
which is used in both sections 3.2.2 and 3.2.3.
Second, regarding the second question in section 3.2.3, we here formulate a simple decomposition of a value function to multiple subsystems. Let us consider a set of n subsystems, each of which contributes to a single value function V(y), with a set of weighting functions, , normalized by . Then a single value function V(y) can be decomposed as where the value function of each system i, Vi(y), is given by
formula
When varies depending on y, Vi(y) is expected to contribute V(y) differentially at different timescales. Note that the TD error of each subsystem, denoted by , can be defined by
formula
3.3
where . Each subsystem can acquire its own value function using this TD error. This subsystem TD error is an interesting quantity in its own right and worthy of future investigation. As a specific example of the decomposition, we use a sort of softmax, letting . In the example in section 3.2.3, we further simplify this equation using v0 for n=2:
formula
Below, we used the simplest choice: v0(y)=cy, where c is a positive constant.

3.2.2.  Discount Function and Choice Reversal.

Here, we use a single reward case, given by , where y1 indicates the time of reward and y indicates the time of valuation, and below we also use y0 and t0, y0=f(t0) as the origin. In this case, the value function becomes equivalent to the discounting function by setting R=1, which is done here. V(y; y1) is reexpressed as
formula
3.4
where a specific form of f is chosen based on mathematical considerations (see section 3.2.1).

In Figure 2A, the discounting function is shown (solid line) in conventional time. It is computed using V(f(t0); f(t1)) as a function of t1, where the time of origin t0 is the time of valuation. The discounting function has a steep decay in a short period, approximately exponentially discounted, but a slow decay in a long period, approximately hyperbolically discounted. In Figure 2B, y=f(t) is shown as a solid line; y is almost linearly proportional to t in a short period, whereas it becomes logarithmic in a long period. In terms of (dashed line), it becomes increasingly small for a large value of t. These effects underlie the coappearance of exponential and hyperbolic discounting in different delays in V(y)=V(f(t)).

Figure 2:

A value function of internal TD exhibits coappearance of exponential and hyperbolic discounting in conventional time. (A). When expressed in conventional time t, the discounting function of the internal TD (solid line) exhibits exponential discounting at shorter delays but is hyperbolic at longer delays. For comparison, exponential (; dotted line) and hyperbolic (1/(1+kt); dashed line) discounting functions are also shown. The two insets provide magnified views of the periods encompassing the shorter (dark gray shading) and longer (light gray shading) delays. Units for both value magnitude and time are arbitrary, which also applies to the other panels in this figure as well as to Figures 35. Parameters used in the figures are τ = 6, k=0.3, a=0.31, and b=1.8. They are manually set. We first set τ = 6 and k=0.3. We then searched for values of a and b in two steps. Using least squares, we first obtained a=0.31 and b=1.22, so that the internal TD discount function approximated the exponential and hyperbolic discount functions at their corresponding delays, [0, 5] and [60, 100], respectively. Then, setting a=0.31 and forcing initial condition y=f(t)=0 at t=0, resulted in b=1.8. (B). The relation of internal time to conventional time is shown by y=f(t) (solid line, left axis) or (dashed line, right axis, where Δ t is set to 1). (C) Choice reversal with hyperbolic discounting. Two choices are shown, with choices A (dashed lines) and B (solid lines). A is chosen when the time of evaluation is t=50, whereas B is chosen when the time of evaluation is t=0. (D) Choice reversal similarly occurs for the same two choices A and B with the internal TD value function shown in A, expressed in conventional time. (E, F) The discounting curves of the same internal TD are now shown in internal time, separately for the case when the time of evaluation is t=0 in E and t=50 in F. See the text for an explanation of the inset in E. The inset in F shows the discounting curves of the main panel in conventional time.

Figure 2:

A value function of internal TD exhibits coappearance of exponential and hyperbolic discounting in conventional time. (A). When expressed in conventional time t, the discounting function of the internal TD (solid line) exhibits exponential discounting at shorter delays but is hyperbolic at longer delays. For comparison, exponential (; dotted line) and hyperbolic (1/(1+kt); dashed line) discounting functions are also shown. The two insets provide magnified views of the periods encompassing the shorter (dark gray shading) and longer (light gray shading) delays. Units for both value magnitude and time are arbitrary, which also applies to the other panels in this figure as well as to Figures 35. Parameters used in the figures are τ = 6, k=0.3, a=0.31, and b=1.8. They are manually set. We first set τ = 6 and k=0.3. We then searched for values of a and b in two steps. Using least squares, we first obtained a=0.31 and b=1.22, so that the internal TD discount function approximated the exponential and hyperbolic discount functions at their corresponding delays, [0, 5] and [60, 100], respectively. Then, setting a=0.31 and forcing initial condition y=f(t)=0 at t=0, resulted in b=1.8. (B). The relation of internal time to conventional time is shown by y=f(t) (solid line, left axis) or (dashed line, right axis, where Δ t is set to 1). (C) Choice reversal with hyperbolic discounting. Two choices are shown, with choices A (dashed lines) and B (solid lines). A is chosen when the time of evaluation is t=50, whereas B is chosen when the time of evaluation is t=0. (D) Choice reversal similarly occurs for the same two choices A and B with the internal TD value function shown in A, expressed in conventional time. (E, F) The discounting curves of the same internal TD are now shown in internal time, separately for the case when the time of evaluation is t=0 in E and t=50 in F. See the text for an explanation of the inset in E. The inset in F shows the discounting curves of the main panel in conventional time.

Next, we show that choice reversal occurs with this value function in an intertemporal choice task. In this task, the subject is asked to choose one of two options: A or B. Each choice has a pair of reward magnitude and delay, denoted as (R, dt); for example, choices A and B are set as (R, dt)=(10, 50) and (50, 80), respectively (see Figure 2C). In this example, the time delay to the earlier choice (henceforth called the initial delay) is 50. Choice reversal refers to the behavioral phenomenon whereby the subject reverses the choice as the initial delay increases (Ainslie, 1975; Thaler & Shefrin, 1981). In contrast with the above example, consider another example where the initial delay is 0 so that choices A and B are set as (R, dt)=(10, 0) and (50, 30), respectively. Let us call the first and second example the “longer” and “shorter” case, respectively, referring to the length of the initial delay. Suppose the subjects choose A in the shorter case. With exponential discounting (in the t-system), A can then also be chosen in the longer case, and thus there is no choice reversal due to the shift-invariant property of the exponential discounting (i.e., V(t; t1)=V(tc; t1c), where and c indicates a time shift in t as a constant). In contrast, hyperbolic discounting can account for choice reversal (see Figure 2C). In this discounting curve, the value of the curve at t is read as the discounted value when t is the time of valuation. Therefore, for the shorter case, we compare the discounted value magnitudes of choices A and B at t=50 and see that A is chosen. For the longer case, at t=0, B is now chosen, and thus the choice is reversed.

The discounting curve of the internal TD value function (see equation 3.4) is shown in conventional time (t-system) in Figure 2D. By reading the discounting curve in the same way as in Figure 2C, we see that the choice reversal occurs between the two cases. We show the discounted curves of the same internal TD in internal time (y-system) for the longer case in Figure 2E and the shorter case in Figure 2F. The curves are only exponentially discounted in internal time. At the time of evaluation, y=0; however, we see that B is chosen in the longer case, whereas A is chosen in the shorter case, indicating a choice reversal. In the y-system, the reversal occurs because the delays in the two choices are quite different in internal time between the two cases, although they are constant in conventional time (i.e., 80−50=30−0=30).

Let us clarify the relationships of the discounting curves expressed in internal and conventional time. The curve in the y-system (see Figures 2E and 2F) is given by V(y; y1) as a function of y, representing the time of valuation from y0 to y1 (or equivalently as a function of ). For the curve in the t-system (see Figure 2D), first recall that the mapping function y=f(t) is defined by assuming that the time of valuation is the origin. Hence, the curve is given by V(f(t0); f(t1t)) as a function of t so that we can directly read the discounted value from Figure 2D when the time of evaluation is at t. The two equations are equivalent, as , where we define such that . In general, however, we have
formula
The inset of Figure 2E shows V(f(t); f(t1)), which is clearly different from V(f(t0); f(t1t)) in Figure 2D. The inset of Figure 2F shows V(f(t0); f(t1t)), which corresponds to the curve in the main panel of Figure 2F and is equivalent to a curve that is obtained by resetting t0 to 50 in Figure 2D. This serves as a good reminder that the time of valuation must be carefully considered when defining y=f(t) (see section 4.2).

3.2.3.  Discount Functions with Multiple Systems.

Here we address the two questions stated above, assuming that the internal TD discounting curve underlies behavior. First, can the internal TD be better approximated by two subsystems, each of which follows exponential discounting (in the t-system), compared to a single exponential discounting system (in the t-system)? To compare the two fits, we used the Akaike information criterion, as the double exponential is a larger-class statistical model of the single exponential. We found that the two subsystems with double exponential discounting fit the internal TD discounting curve significantly better than did the single exponential discounting (Figure 3A, see the legend for details of this fit). Thus, when the internal TD underlies the behavioral data and if the data are examined as in the previous experiments (McClure et al., 2007; Glimcher et al., 2007), the discounting curve of the data could be judged as consistent with double rather than single exponential discounting (McClure et al., 2007; Glimcher et al., 2007).

Figure 3:

Internal TD value function based on multiple valuation systems. (A) For comparison of fits to the internal TD, mean values of Akaike information criterion (AIC) for single () and double () exponential functions are shown, taking residuals of the fits as gaussian noises (and thereby excluding the common terms in the two values). Error bars indicate variance. These were obtained using 100 data sets, resulting in single −12.7 (0.29) and double −45.9(0.20), respectively. Each data set of nine points was generated by approximately following a typical setting in experiments, that is, a small number of samples. For each data set, the duration was divided into nine contiguous subintervals on a logarithmic scale (using the logspace command in Matlab), and one point was sampled randomly from each of the nine subintervals. Each data set was then fitted to the single and double exponentials using nonlinear least squares. (B) Value function is decomposed into two subsystems, indicated as V1 (dotted line) and V2 (dashed line). (C) Normalized values of the two subsystems (as Vi/(V1+V2)) are shown for the earliest and latest periods.

Figure 3:

Internal TD value function based on multiple valuation systems. (A) For comparison of fits to the internal TD, mean values of Akaike information criterion (AIC) for single () and double () exponential functions are shown, taking residuals of the fits as gaussian noises (and thereby excluding the common terms in the two values). Error bars indicate variance. These were obtained using 100 data sets, resulting in single −12.7 (0.29) and double −45.9(0.20), respectively. Each data set of nine points was generated by approximately following a typical setting in experiments, that is, a small number of samples. For each data set, the duration was divided into nine contiguous subintervals on a logarithmic scale (using the logspace command in Matlab), and one point was sampled randomly from each of the nine subintervals. Each data set was then fitted to the single and double exponentials using nonlinear least squares. (B) Value function is decomposed into two subsystems, indicated as V1 (dotted line) and V2 (dashed line). (C) Normalized values of the two subsystems (as Vi/(V1+V2)) are shown for the earliest and latest periods.

Second, can the internal TD value function be constructed by multiple subsystems? If so, does each subsystem show a dominant differential role at different timescales, as shown in the previous experiments? The internal TD value function was first decomposed into the value functions of the two subsystems (see Figure 3B; and section 3.2.1), V1 and V2. V1 predominantly contributed to representing the value function with a short delay, whereas V2 predominantly contributed to representing the value function with a long delay (see Figures 3B and 3C). Thus, each subsystem contributed differentially to valuation at different timescales and behaved similar to the previously described multiple subsystems (Laibson, 1997; Tanaka et al., 2004; McClure et al., 2004; Montague et al., 2006; Fudenberg & Levine, 2006; Berns et al., 2007; McClure et al., 2007). On the other hand, neither of the subsystems followed exponential discounting (the inset in Figure 3B), partly because we did not perform any fine-tuning. It would be interesting to see if there is a case of decomposition where each subsystem approximately follows exponential discounting. From another viewpoint, however, it is not necessary to have each subsystem follow exponential discounting, as this theoretically favored property is preserved as a whole internal TD in internal time. The division of the internal TD value functions of subsystems may reflect the different neural properties of each subsystem (Berns et al., 2007). It is noteworthy that each subsystem can possibly learn its value function by directly using its own TD errors (see equation 3.3).

4.  Further Remarks on Internal TD's Implications

4.1.  Effect of Internal Time's Noise on TD Error.

If the internal time process is confounded with noise, internal TD behaves differently from conventional TD, even when the units of the two times are the same (). Indeed, recent experimental studies (Fiorillo, Newsome, & Schultz, 2008; Kobayashi & Schultz, 2008) indicate, contrary to the prediction given by the conventional tapped-delay-line TD model, that phasic DA activity, putative TD error, appears at reward onset, as the time interval between a conditioned stimulus (CS) and reward onset (called interstimulus interval, ISI) increases, even when the ISI is fixed for a given CS. Conceptually, such a possibility has already been raised (Gallistel & Gibbon, 2000; Montague & Berns, 2002) with the suggestion that increases in ISI induce more uncertainty in reward prediction. We now consider a situation in which only noise differentiates the y-system and t-system. We first provide a mathematical formulation and then show simulation results. The purpose of the simulation was to show the basic phenomenon, and we did not fine-tune them.

4.1.1.  Mathematical Formulation.

An internal time y can be uniquely mapped to a conventional time t by t=f−1(y), but in the presence of noise, this mapping is no longer one-to-one. Let us denote such a noisy internal time by (and also ). In general, we have . If we regard as a function of t, , and let follow a probability density P, , the value function is then given by , which is different from the one without noise. Similarly, TD error (in the operator's view, expressed in the t-system) is given by
formula
To show the effects of noise on TD error, we consider a simple case using the discrete-time formulation. We let f be identity, . An additive noise, denoted by , is assumed, given by , where is assumed to be independent and identically distributed (i.i.d.), . The noise occurs in each increment of the discrete step and accumulates over all of the increments. Denoting the number of increments at t by i(t), . The TD error in the observer's view, expressed in the t-system, is thus given by
formula

4.1.2.  Simulation Results.

As an example, we simulate a simple Pavlovian task (Fiorillo et al., 2008; Kobayashi & Schultz, 2008) using episodic schedules (see Figure 4A). With noise, the internal time of reward in each trial, denoted by , varies from the actual conventional time of reward yR(=tR), and this induces TD errors (see Figure 4B). TD error appears at the time of reward, and the magnitude of the error increases as ISI increases; the corresponding TD error at the time of CS decreases. These results match the basic findings of recent studies (Fiorillo et al., 2008; Kobayashi & Schultz, 2008).

Figure 4:

Noise in internal time induces TD error. (A) Conditioning experiment. The conditioned stimulus (CS) is followed by reward (R) with different interstimulus intervals (ISIs). (B) TD errors appear at both CS and R times with different magnitudes depending on the length of the ISI. (C) Distribution of the reward onset time in internal time. (D) Mean value function around the time of the reward for all ISIs with noise (and without noise as reference) in internal time increments. Simulation was done with each ISI condition as an episodic task. For each ISI, the time of CS was set to 10, and the time of R set according to the ISI; different ISIs were simulated, as shown in the legend of the figures. We set . Noise ϵ was given at each iteration (only after CS presentation) as a counting error at a rate of 0.003 per increment, taking ± 1 when error occurred. Both TD errors and distributions were obtained after learning had converged (typically after 500 trials), shown as the average of 2000 more trials. We set γy = 0.95 and learning rate =0.1.

Figure 4:

Noise in internal time induces TD error. (A) Conditioning experiment. The conditioned stimulus (CS) is followed by reward (R) with different interstimulus intervals (ISIs). (B) TD errors appear at both CS and R times with different magnitudes depending on the length of the ISI. (C) Distribution of the reward onset time in internal time. (D) Mean value function around the time of the reward for all ISIs with noise (and without noise as reference) in internal time increments. Simulation was done with each ISI condition as an episodic task. For each ISI, the time of CS was set to 10, and the time of R set according to the ISI; different ISIs were simulated, as shown in the legend of the figures. We set . Noise ϵ was given at each iteration (only after CS presentation) as a counting error at a rate of 0.003 per increment, taking ± 1 when error occurred. Both TD errors and distributions were obtained after learning had converged (typically after 500 trials), shown as the average of 2000 more trials. We set γy = 0.95 and learning rate =0.1.

The distribution of is shown in Figure 4C. It is symmetrically unimodal due to the symmetrical noise assumed in the simulation. As the ISI increases and the noise thus accumulates, the distribution of widens, resulting in a larger TD error at the time of reward (see Figure 4B). TD error also has suppressive dips just before and after the time of reward (see Figure 4B). This can be understood as the value function becoming more diffuse around the time of reward as the ISI increases (see Figure 4D), where the value function in the no-noise case is shown by the thick dashed line for comparison.

This simulation result is still primitive but clearly demonstrates that the internal TD model shows phasic DA activity as the ISI increases, even when reward is perfectly timed in conventional time. Temporal imprecision is speculated to be a cause of this DA activity in experimental studies (Fiorillo et al., 2008; Kobayashi & Schultz, 2008), and based on this notion, a mathematical description of this DA activity has been developed (Fiorillo et al., 2008). Related to this, although the units were the same in the two times (Δ y = Δ t) in this simulation, in reality the internal time unit might also increase at a longer ISI or later in the ISI (Δ y > Δ t) (Fiorillo et al., 2008). The suppressive dips of DA activity in the simulation have not been experimentally observed, to our knowledge, but a gradual decrease in the baseline DA activity before reward onset has been reported (see Figure 5 of Fiorillo et al., 2008). It would be interesting to extend the internal TD model by taking all of these features into account.

Figure 5:

Effect of internal time modulation. (A) The internal value function is increasingly discounted as internal time evolves faster (when f(t) is larger). Two different internal TD value functions (a) and their corresponding internal time evolutions (b) are shown. The internal time mapping is given by yi=fi(t)=f(t; ai, bi) (i=0, 1) (the form of f is the same as the one used in Figure 2B), where (a0, b0)=(0.31, 1.8) and (a1, b1)=(0.8, 0.7). (B) Internal time modulation through as a variable duration Δ t(y) (a), and variable discount factor (b), in which four curves are plotted using different time constants (τ) as shown in the legend.

Figure 5:

Effect of internal time modulation. (A) The internal value function is increasingly discounted as internal time evolves faster (when f(t) is larger). Two different internal TD value functions (a) and their corresponding internal time evolutions (b) are shown. The internal time mapping is given by yi=fi(t)=f(t; ai, bi) (i=0, 1) (the form of f is the same as the one used in Figure 2B), where (a0, b0)=(0.31, 1.8) and (a1, b1)=(0.8, 0.7). (B) Internal time modulation through as a variable duration Δ t(y) (a), and variable discount factor (b), in which four curves are plotted using different time constants (τ) as shown in the legend.

4.2.  Internal Time Construction.

So far internal time has been introduced statically; y=f(t) is introduced, and is derived accordingly. In this section, we discuss the basic notions for constructing internal time dynamically. In fact, the Pavlovian case in the previous section is such an example, although the only dynamic factor was the noise.

4.2.1.  Dynamic Construction of Internal Time.

A dynamic construction sets first and then induces f(t), that is,
formula
In this dynamic view, is an active process that can be potentially modulated by several factors (e.g., external events, internal processes, or noise). “Present” time has a special status within all conceptions of time, given that all humans and animals live, by definition, in their present time. At least two components are needed for a dynamic construction: the construction of ongoing time and prospective, or future, time. The former is for constructing internal time concurrently with “present” time, so that f(t) can be constructed only up to or around the present time (denoted by tp, or yp=f(tp)). The latter is needed to construct internal time in the prospective use of time, for future time (y > yp or t > tp).
There is an interesting subtlety resulting from this dynamic construction; even the same conventional time may be represented differently in internal time, depending on which “present” time is used to specify internal time. Consider , where tp and are two temporal moments of the present time. Unless the two components of the dynamic internal time construction are exactly the same, we have
formula
where tp and are added to f to indicate the present time used for the internal time construction. Given this, it is conceivable to have
formula
This indicates that the internal time anticipated from tp to (prospectively) can be different from the internal time elapsed from tp to (i.e., retrospectively) (Roesch & Olson, 2005; Genovesio, Tsujimoto, & Wise, 2006; Doya, 2008), which raises interesting possibilities (see section 5).

4.2.2.  The Present Time and the Origin of Time.

Here we discuss potential relations between the present time and the origin of time. First, we note that if we change the origin, f changes. Indeed, the reference frame and origin of time must be defined so as to define the mapping y=f(t) in the first place (here the origin can be nonzero, being the corresponding point between y and t). There are two ends of a spectrum that can generally be used to define them. At one end is the absolute (time) framework that sets the origin at a time and fixes it forever. Although the absolute framework is possible, its direct application to internal TD seems infeasible. At the other end is the relative (time) framework, which lets the origin change as time goes by. Let us write the relative framework as , where is the time (as defined by the absolute framework) used as the origin of the relative time framework. The simplest example is to set at the present time (defined in the absolute framework), that is, . But in general, does not have to be equal to , so we regard it as (i.e., a function of but left unspecified; see section 5).

By definition, the present time (tp and yp) should be expressed in the relative framework , so that it can act as the origin or corresponding point. Then we should examine for the dynamic construction of ongoing time. First, if is constant, we have , where is constant, so that internal TD is essentially the same as conventional TD (see the appendix). Next, there can be two cases where internal TD differs from conventional TD in the dynamic construction, that is, when is no longer constant. The first case is that becomes a function dependent on , when the form of may change over time (conventional time in the absolute framework). The second case is that a function may change. In this case, even if a form of is unchanged, the value of is different as a function of . Taking the two cases together, we have
formula
4.1
so . Consequently, is now described in the absolute framework, and in other sections, f(t) can also be understood in this way, although it was originally generated from the function in the relative framework. This dynamically constructed internal time may in principle have a variable duration in conventional time.

A similar argument applies to the dynamic construction of future time. The integral (similar to equation 4.1) may now go from to a future time. This dynamic construction can be done only mentally, in a way perhaps similar to mental time travel (Arzy et al., 2008; Boyer, 2008) or constructing a representation of future events (Szpunar et al., 2007; Liberman & Trope, 2008). Alternatively, the integral may be given in a static way, perhaps purified as a representation of prospective time by repeated experience of imagining the future. Either way, the unit of this internal time for future time may also have a variable duration in conventional time.

4.3.  Internal Time Modulation and Serotonin's Internal Time Hypothesis.

Here we summarize the effect of internal time modulation on neural valuation and then discuss the internal time hypothesis of serotonergic neuronal function.

4.3.1.  Internal Time Modulation.

Internal time modulation can be expressed through a change or modulation in f(t) (long timescale) or equivalently in (short timescale). For a long timescale, we can generally state that the value function V(f(t)) is discounted more heavily at time t when internal time goes faster (that is, f(t) is larger; see Figure 5A).

Internal time modulation expressed as a change in illustrates a short-timescale property. It changes the duration in the t-system, , in the operator's view or, interchangeably, changes the discount factor in the observer's view (see Table 1 and the appendix). When internal time goes faster ( is larger), more units of the process are needed for the same time period of the t-system (see Figure 5B, a) and, interchangeably, the discounting is stronger ( is smaller, which is partly dependent on the time constant τ; see Figure 5B, b). Suppose that is modulated online in an experimental trial. Consider modulating to be smaller while waiting for the reward, that is, while knowing that reward will eventually come but finding that it has yet to come (in the t-system); the modulated smaller leads to a longer unit of process in the t-system (a larger Δ t(y)) so that the internal TD can still reside in the same time step rather than advancing to the next step. Such a modulation makes the TD model's valuation more resistant to temporal variation of rewards and events. This simple example, although oversimplified (see section 5), reveals the advantage of online internal time modulation.

4.3.2.  Internal Time Hypothesis of Serotonin Functions.

Here, we propose an internal time hypothesis of serotonin function, which suggests that serotonin neurons modulate internal time in neural valuation. A more restricted version of the hypothesis states that an increase in serotonin neural activity makes internal time go more slowly (leading to a smaller f(t) and ). Admittedly it is nearly impossible to describe all of the serotonin functions with a single theory (Daw, Kakade, & Dayan, 2002), given the extremely complicated nature of serotonin activities (Jacobs & Azmitia, 1992; Buhot, 1997; Sershen, Hashim, & Lajtha, 2000; Cardinal, 2006; Cools, Roberts, & Robbins, 2008). Nevertheless, a computational theory would help us to clarify our normative understanding, and in this spirit, several hypotheses regarding serotonin function have been proposed in relation to TD models (Daw et al., 2002; Doya, 2002; Dayan & Huys, 2008). The internal time hypothesis follows this approach, largely inspired by the discount factor hypothesis (Doya, 2002). This hypothesis largely bases its argument on the experimental observations that the tendency to make impulsive decisions decreases when serotonin activity is higher (Mobini et al., 2000; Denk et al., 2005; Winstanley, Eagle, & Robbins, 2006; Schweighofer et al., 2008), and it thus proposes, based on conventional discrete-time TD, that an increase in serotonin neural activity leads to an increase in the discount factor, thereby reducing impulsive decisions.

When f(t) is restricted to be linear, internal and conventional TDs become largely equivalent, except for issues in interchangeability between the unit and the discount factor (see the appendix), and this makes the internal time hypothesis (of the restricted version) equivalent to the discount factor hypothesis, at least in the observer's view. In this view, the internal time hypothesis predicts that the elevated serotonin activity makes the discount factor () larger in the same manner as does the discount factor hypothesis. In contrast, under the operator's view, the internal time hypothesis predicts that the elevated serotonin activity makes the unit duration in the t-system (Δ t(y)) larger. Indeed, this is consistent with several experimental observations that putative serotonin neural activity is tonically elevated while waiting for reward (Nakamura, Matsumoto, & Hikosaka, 2008; Miyazaki, Miyazaki, & Doya, 2007; Mizuhiki, Inaba, Toda, Ozaki, & Shidara, 2008). Behaviorally, not exponential but hyperbolic temporal discounting is often observed with normal subjects, implying that f(t) is nonlinear (see section 3.2); thus, the nonlinearity effect of f(t) is important for understanding the effects of serotonin activity on impulsivity. Such impulsivity might be due to a differential modulation of time at different time delays (Wittmann & Paulus, 2008), implying that time duration (in the operator's view) or discount factor (in the observer's view) is modulated differentially at different delays. Furthermore, this differential modulation may function differentially in different circuits, such as, different cortico-basal ganglia loops (Nakahara et al., 2001; Doya, 2002).

There is also an interesting connection of our hypothesis to another recent hypothesis (Dayan & Huys, 2008) suggesting that elevated serotonin activity prunes a decision tree specifically for negatively estimated outcome states and choices. Our hypothesis is closely related to this (except for the specificity for negative outcomes) if we move from a cache system (discussed as a TD model) to a decision tree system (as used in model-based TD learning), that is, if the internal time modulated by serotonin activity is regarded as the depth of a tree search in valuation rather than as the time unit of a cache system.

Finally, several caveats must be mentioned. First, a hypothesis of a single role for serotonin activity in decision making may be misleading if it is taken as excluding other functions, given the rich variety of serotonin functions. Other serotonin neurons appear to show rather tonically suppressed activity while waiting for reward (Nakamura et al., 2008). They also respond diversely to external events or behavioral correlates as recently reported (Ranade & Mainen, 2009). Second, some literature suggests that other neural modulators, such as DA and cholinergic activities, rather than serotonergic activities, are more involved in subjective interval timing (Buhusi & Meck, 2005; Matell, Bateson, & Meck, 2006; Meck, 1996). As discussed in section 1, it is not yet clear how we should map those findings onto the current internal TD formulation. Further studies are required to develop TD formulations that can explicitly include a subjective timing system with regard to the time of neural valuation processes and then to examine those findings in the TD formulations. Related to this issue, our hypothesized internal time modulation by serotonin neural activities can be viewed as a consequential rather than a causal effect (Ranade & Mainen, 2009; Ho, Velázquez-Martínez, Bradshaw, & Szabadi, 2002), as those activities may directly affect the temporal progression of neural activities that represent states in TD models. This possibility is of particular interest for future studies (see section 5.5).

5.  Discussion

In this study, we have investigated the internal-time TD formulation that uses the operator's time, distinct from the observer's time, to construct a continuous-time value function. We called the operator's time internal time and the observer's time (the time usually used in experiments) conventional time. We focused on formulating an internal-time TD framework and investigating its consequences to better understand neural value-based decision making.

5.1.  Operator-Observer Problem.

The internal TD formulation explicitly deals with the operator-observer problem, whether the TD model is used for modeling the operation processes of neural valuation or for providing a descriptive model of the processes in the observer's view (see section 2.4). The same issue regarding differential modeling in the operator's and the observer's views also applies to other types of TD algorithms (e.g., algorithms using action-value functions) if the time system of the operator is different from that of the observer. More broadly, it also applies to other reinforcement learning algorithms, as treating rewards with time delays is often central to these algorithms, and it also arises in processes other than neural valuation when the operator's and the observer's time are different. The approach taken in this study is thus potentially applicable to modeling many neural processes and, in a broader context, belongs to studies examining the effects induced by linking different time systems (see e.g., Acebron, Bonilla, Vicente, Ritort, & Spigler, 2005).

In neural valuation, the operator-observer problem does not have to be considered if the units of the operator and observer times (Δ y and Δ t) are always the same (with a fixed linear constant). In this case, the operations in that valuation are equivalent for the two views. Or if we assume that the two units are equivalent, conventional TD (more precisely, equally dissected conventional TD) can be used.

When the two units are different in some way, the operator and observer models are different in principle. If we choose to model a given process in the observer's view, the discrete conventional time unit can still be used. In this case, note that the discrete-time TD error in the observer's view is expressed using a discount factor that is variable based on its operation processes (see Table 1). This has several implications. For example, when we observe a DA response (putative TD error) in the discrete conventional time unit, it might be better modeled using variable discount factors. In this regard, it is interesting to note that DA activity was found to follow not exponential but hyperbolic discounting as the ISI increased (Kobayashi & Schultz, 2008). Also, an observation with a conventional time unit (Δ t) may correspond to observing multiple iterations of the operator's states concurrently (if ) or the same state repeatedly (if ). The conventional equally dissected TD model, which has been extensively applied to neural valuation, assumes that the unit of the neural operation is fixed in conventional time. This assumption might reasonably hold in cases when the environment is quite stable and the time range (e.g., that of the ISI) is quite short.

5.2.  Temporal Discounting.

Temporal discounting represents the long-timescale property of TD models (in case of discounted rewards). The internal TD preserves a theoretically favored discounting (exponential discounting) in internal time, but this discounting is expressed differently in conventional time, when internal time is different from conventional time. From this perspective, traditional debates over either exponential or hyperbolic discounting using conventional time should be understood as based on the choice of the specific time system and thus on the observer's view. Describing the discounting in the observer's view has many advantages. We can certainly ask if subjects behave “rationally” (exponentially) or “subjectively” (hyperbolically) in the observer's view. It should not, however, be taken for granted as a description from the operator's viewpoint, because the description depends on what time system is used in the operation process. It is possible that the subjects are evaluating “rationally” (i.e., exponentially discounting in internal time) but can still be observed as evaluating “subjectively” (hyperbolically discounting in conventional time), as we demonstrated with the occurrence of choice reversal (see section 3.2.2). Furthermore, it is also possible that the subjects may have exponential and hyperbolic discounting at different delays when observed in conventional time, even though they perform only exponential discounting in internal time.

For a hyperbolically discounted behavior, it has been debated whether there is only a single underlying neural system that directly encodes a hyperbolically discounted value (Kable & Glimcher, 2007) or multiple (or two) subsystems (e.g., a cognitive versus emotional, or rational versus subjective, system), whereby at least one system is usually regarded as obeying exponential discounting (Laibson, 1997; McClure et al., 2004, 2007; Montague et al., 2006; Fudenberg & Levine, 2006; Berns et al., 2007). As the internal TD could induce the choice reversal using only single exponential discounting (in internal time), it is broadly consistent with the single system theory. We also showed in section 3.2.3, however, that, together with the decomposition of the internal TD to two subsystems, the internal TD's discounting curve could still be consistent with the multiple system's theory based on the previous experimental studies. In the decomposition, only the whole system preserves exponential discounting in internal time, but each subsystem need not use exponential discounting (in either internal or conventional time). These notions call for future experimental examinations in several directions, for example, whether the internal TD underlies “the” single system, or whether it is decomposed into multiple subsystems, neither of which obeys exponential discounting in conventional time.

The form of f(t) used in the demonstration of choice reversal was derived purely based on mathematical intuition; we do not have any direct experimental evidence supporting the use of f(t). Interestingly, though, a similar form of f(t) was recently suggested by Peter Shizgal and his colleagues in their research on opportunity cost (personal communication, November 2008; Solomon, Conover, & Shizgal, 2007). A pressing experimental question will be to examine what form of f the neural valuation's operation process really has. In such experiments, the time range must be considered carefully. In tasks like an intertemporal choice task, the time range can often be quite long—hours or months. The internal time construction can be fundamentally different in the ongoing and prospective use of time (see section 4.2), although there may be no clear-cut boundary between the two uses. For instance, prospection of a few seconds is perhaps more appropriately grouped with the ongoing use of time, whereas prospection of hours is definitely best grouped with the prospective use of time (Szpunar et al., 2007; Buckner & Carroll, 2007; Arzy et al., 2008). Separation of such a prospective time from ongoing time also implies that neural valuation can be free from conventional time, particularly in that time range. The relationship of time perception at such a timescale (e.g., logarithmic time perception) to types of discounting has been proposed (Takahashi, 2005; Gilbert & Wilson, 2007; Boyer, 2008; Wittmann & Paulus, 2008). Such a future time may be constructed either dynamically or statically (see section 4.2), but how it is done in neural systems specifically for discounting remains to be determined (Luhmann, Chun, Yi, Lee, & Wang, 2008). Future experiments should address these issues by explicitly probing the internal time construction with respect to the nature of discounting. Apart from the range of the timescale, anticipatory time and elapsed time can be different in the internal time construction, even when they are the same in conventional time (see section 4.2.1). Such a difference may affect neural valuation in both the ongoing and prospective use of time (Roesch & Olson, 2005; Genovesio et al., 2006), and it is also conceivable to apply an internal TD to valuation in retrospection, using the time construction for the time of the past, which is also worthy of future investigation (Schacter, Addis, & Buckner, 2007; La Camera & Richmond, 2008; Dayan, 2009).

This study demonstrated only one case of an internal TD, which has exponential and hyperbolic discounting during short and long delays, respectively. Certainly it is possible to construct a much wider variety of internal TDs, and thus the question about the form of f should be examined, combining both experimental and theoretical works. At any rate, discounting for a temporally distant reward is a more complicated issue than can be covered within the scope of this study (Loewenstein & Prelec, 1992; Frederick et al., 2002; Luhmann, Chun, Yi, Lee, & Wang, 2008). The time unit of discounting does not appear to be invariant but rather to be dependent on the task (Frederick et al., 2002; McClure et al., 2007), and several different factors have been suggested to differentially influence the mechanisms involved in the discounting process, depending on the task (Loewenstein & Prelec, 1992; Frederick et al., 2002; Rubinstein, 2003; Berns et al., 2007). To fully incorporate these factors into the internal TD formulation, the extensions of the formulation are required (see section 5.5).

5.3.  Internal Time Construction and Modulation.

Results from a wide range of experiments (see section 1) are broadly consistent with the notion that the unit of time specifically used for neural valuation (internal time) is different from that of conventional time. Nevertheless, it is currently unclear how internal time is constructed. Presumably internal time is constructed or affected by several different neural time systems. Even in Pavlovian conditioning, different systems with regard to different timescales might be differentially involved, depending on the ISI. In terms of modality, the systems used for perception time (e.g., for estimating a time interval) are likely involved, but the systems for decision or motor time (e.g., taking Pavlovian action or instrumental action in instrumental tasks) might be partially involved.

The modulation of internal time is limited. Under ideal conditions, internal time would be perfectly adjusted in the presence of a variable ISI until a reward was obtained. But this is not the case, even with a relatively small range of variable ISIs (Fiorillo et al., 2008). Neurons in some areas modulate their activity by reflecting an event's temporal probability distribution (Ghose & Maunsell, 2002; Leon & Shadlen, 2003; Janssen & Shadlen, 2005) (but an examination of this issue for DA activity indicated no such modulation, although with caveats, Fiorillo et al., 2008). These neural activities might modulate internal-time units and thus affect valuation, which is possibly related to issues such as motivation or opportunity cost (Wise, 2004; Niv, Daw, & Dayan, 2006).

5.4.  Current Internal TD Formulation.

Although it certainly addresses the dissociation between internal and conventional time, we consider the current internal TD formulation to still be rather primitive. First, we formulated the internal TD only in the limited case of a TD(0) algorithm of the TD (λ) family for the case of discounted rewards (Sutton & Barto, 1998). Several previous proposals to extend discrete-time TD models may also be used for extending internal TD formulations, such as the TD(λ) family, as well as the model-based/multi-timescale or partially observable approach (Sutton, 1995; Sutton & Barto, 1998; Doya, Samejima, Katagiri, & Kawato, 2002; Daw, Niv, & Dayan, 2005; Daw et al., 2006). Also, we did not deal with the case of undiscounted, or averaged, rewards (Schwartz, 1993; Puterman, 1994; Mahadevan, 1996). A popular algorithm in this case, often called average TD, has a value function called a bias, or relative, value (Mahadevan, 1996) that uses a term, average reward estimated over time, in addition to terms of temporal difference of reward and value. The average reward is affected by how much time has passed. Thus, it is also affected by the distinction between internal and conventional time, for example, if it is measured using internal time rather than conventional time, although the effect of the distinction on the discount factor does not exist in this case. We are interested in examining the effects of the internal average TD on this issue. This would also help clarify the relation of the internal TD formulation to the semi-Markov approach using the conventional average TD (Daw et al., 2006). Furthermore, in this study, we dealt only with the state-value functions for clarity of exposition. We did not make an explicit connection to actions that would cause state transitions. A future promising avenue is to extend internal TD formulations to the case involving action-value functions or state-value functions with an explicit action selection mechanism (e.g., actor-critic algorithm). For example, under internal TD formulations, it is intriguing to include an action that would reset the origin of the time (see section 5.5). The issues relating to actions must be examined together with the issue discussed below.

Second, in the current internal TD formulation, we chose to preserve the original theoretical formulation of a discrete-time TD, where the number of discrete-time steps (state iterations) is directly linked to the number of iterations of the (constant) discount factor (Sutton & Barto, 1998). Note that in the perspective of MDPs, states must first be defined before building discrete-time TD models, and a discrete-time step is then an auxiliary variable of the states. But we have not discussed states so far, primarily because this study focuses on the effects of dissociating the operator's and the observer's times in the TD formulation. When the original principle is applied to connecting continuous-time to discrete-time internal TD formulations, a rigid relationship between the two times is maintained, and the discrete-time unit plays a dual role as a unit for both operation and duration (which is also true for the case of conventional-time TD as equally dissected conventional TD). Thus, in the current TD formulation, the equally dissected model in internal time is the operator's discrete internal TD model. The rigid relationship further implies that the unit of a neural operational process, equivalent to the discrete internal time unit (or the duration), also corresponds to a state. Accordingly, the rigid relationship is also maintained with states. In contrast, it is often the case that the conventional TD (equally dissected conventional TD) specifies states only as external events (e.g., CS and US in Pavlovian conditioning), and event-triggered representations then act as placeholders of time durations between the events, provided that Markov properties are assumed with respect to the events. Because of this assumption, times between events, although not called states, can still effectively act as states under MDPs for iterations of TD operations. This approach is often a useful simplification for constructing a succinct TD model for neural valuation. It will not be sufficient for modeling, however, if the Markov assumption does not hold (e.g., Nakahara et al., 2004) or the neural valuation process does not use a discrete conventional unit. In these cases, we must address the nature of states, even at times between events, regardless of what approach we take, but the current internal TD is one approach that addresses this issue.

The equally dissected internal TD model can accommodate variable durations in conventional time (see Table 1), and the event-dissected conventional TD model is a special case of this internal TD model, where a discrete internal time unit is specifically assumed to be the duration between external events. Since the TD hypothesis of DA activity was proposed, it is puzzling that there has been no convincing experimental evidence indicating that DA responses (putative TD error) propagate backward continuously over conventional-time units from the time of reward onset to the time of the CS, as the number of trials increases (Pan et al., 2005). The internal TD approach appears promising for dealing with this issue because the TD error is propagated over iterations of internal-time units.

5.5.  Internal Time, Subjective Timing, and State.

As evident in the discussion, many issues remain to be investigated to extend the internal TD formulation. First, the question remains whether the rigid relationship among states, associated durations, and discount factors should always be maintained in the modeling of neural value-based decision making. Consider temporal discounting as an example. As discussed, a variety of factors can affect the nature of discounting. Some factors may be best understood in the context of their effects on internal time, for which application of the current internal TD would be promising, but other mechanisms may be required to deal with other factors. What is more fundamental, given the distinction between the operator's and the observer's times, is that as long as we attempt to model the operational process, we must identify the operator's unit, and then other parameters, such as a discount factor, must be determined with respect to that unit. Any change in the parameters must also be evaluated with respect to the operator's unit; that is, if the discount factor becomes variable, it should be modeled accordingly with respect to the unit. For example, the discount factor can be modulated by some factors (e.g., a function of duration in conventional time), and such a case would be a kind of semi-Markov internal TD. In this view, the semi-Markov conventional TD (see equation 3.1) belongs to this class in that it specifies duration as (probabilistically) equally dissected in internal time according to external events (event-dissected in conventional time) but the discount factor as an exponential function of duration in conventional time.

Also remaining to be investigated is how the origin of internal time is actually set in neural systems, especially regarding the ongoing use of time. We suspect that the origin of time often shifts with the ongoing time, but may also be locked to externally salient events. Furthermore, it is also conceivable that the origin may change, triggered by statistical inferences made by other neural systems. With such origins, the progression (and modulation) of internal time may be affected by subsequent events or factors involved in task situations, such as context, hidden states, uncertainty, time estimation, and event anticipation (Suri & Schultz, 2001; Nakahara et al., 2004; Daw et al., 2005; Janssen & Shadlen, 2005; Preuschoff, Bossaerts, & Quartz, 2006; Morris, Nevet, Arkadir, Vaadia, & Bergman, 2006; Yamada, Matsumoto, & Kimura, 2007; Kubota et al., 2009).

The construction and modulation of internal time is intimately related to the construction and modulation of states, or to state transition in neural systems. Discussions of time construction and modulation can be directly translated to those of states. For example, possible differential contributions of perception and motor time to the construction of time can be translated to the question of how neural perception and motor functions contribute to or affect neural correlates that would correspond to states in the TD formalism. More generally, states in neural valuation would ultimately be a pattern of neural activity. Several proposals have been made for understanding the roles of DA activity in neural valuation from perspectives often quite different from that of a normative TD framework (Brown, Bullock, & Grossberg, 1999; Redgrave & Gurney, 2006; O'Reilly, Frank, Hazy, & Watz, 2007; Izhikevich, 2007; Tan & Bullock, 2008; Potjans, Morrison, & Diesmann, 2009; Soltani & Wang, 2010; Hazy, Frank, & O'Reilly, 2010). They provide their accounts more in relation to patterns of neural activity. Future research is needed to investigate how extensively these ideas can be integrated with the viewpoint of the internal TD formulation. We hope that in the long run, the current internal TD formulation helps to build a closer link of these types of accounts with computational-level accounts, usually provided by TD models. Neural correlates of valuation are important clues for future experiments regarding internal time units and states; neural correlates of values must be exponentially discounted with respect to iterations of internal-time units. This characteristic can be used in experimental probes; by manipulating experimental conditions, we can inspect how similarly the neural correlates and behavioral valuation change over those conditions. Related to these issues, a recent proposal in TD formulation suggests that external events provoke a set of temporally progressive inputs to the TD value function, as a more sophisticated temporal stimulus representation (Ludvig, Sutton, & Kehoe, 2008). Time is essentially embedded in the temporal progression (Grossberg & Schmajuk, 1989), which can also be considered as states progressing over time. Then, if some explicit mechanisms were to adjust this temporal progression, they would act as changing states and thereby also influence valuation.

Finally, such an additional modulating mechanism, together with mechanisms setting the origin of time in ongoing use, appears very promising for addressing a critical issue mentioned in section 1: How should subjective timing (timing behavior or subjective interval estimation) be mapped onto the internal TD formulation? Daw et al. (2006) made a step in this direction using a semi-Markov conventional TD formulation (a class of the internal TD model, whereby subjective timing was modeled as a probabilistic inference of durations between external events), further combined with partial observability as well as use of average TD. The internal TD formulation helps to carry these ideas further and makes a further connection to operational processes of neural valuation, as the dissociation between internal and conventional time makes it possible to explicitly model the effects of subjective timing in a more flexible manner. Subjective timing may directly influence the progression of internal time units, act as modulating or switching states (which are not necessarily locked with external events), and/or have a direct influence at the level of valuation or action selection (e.g., changing the so-called temperature parameter in the soft max function for the selection). The interdependence and differential contributions of different attributes (e.g., motivational and temporal) for learning associations between CS and US have been studied in Pavlovian conditioning (Delamater & Oakeshott, 2007). Changes in the internal time units might be differentially affected by what time systems (e.g. perception or motor time) are acting at a given moment. A rich literature on interval timing provides important clues toward resolving these issues (Gallistel & Gibbon, 2002). Conflicting views exist, however, about several of these issues, for example, whether there is a dedicated interval timing system and, if so, whether it is composed of a single system or multiple systems (Gibbon et al., 1997; Killeen, Fetterman, & Bizo, 1997; Zeiler, 1998; Dragoi et al., 2003; Buhusi & Meck, 2005); whether the property known as Weber's law or the scalar property holds and, if so, how (Gibbon, 1977; Grondin, 2001); and whether the passage of subjective time is linear or nonlinear to conventional (objective) time in relation to different experimental paradigms (Gibbon & Church, 1981; Staddon & Higa, 1999; Cerutti & Staddon, 2004). Also different models and mechanisms have been proposed for subjective interval timing (Gibbon, 1977; Church, 1984; Killeen & Fetterman, 1988; Staddon & Higa, 1999; Gallistel & Gibbon, 2000; Dragoi et al., 2003; Buhusi & Meck, 2005). Given these considerations, although extending the internal TD formulation to include issues of interval timing is of great interest, it may be best addressed by first explicitly formulating a simple modulatory effect of subjective timing progression on internal time units and states.

Appendix:  Relation of Conventional and Internal TD When f(t) Is Linear

Let us consider the case where f is linear, y=f(t)=ct (c is a positive constant) with a single reward. The value function is given by
formula
A.1
First, internal and conventional TDs are essentially equivalent in this case. V(y)=V(f(t)) can be understood as a conventional TD with the unit Δ t and the constant discount factor , where . Another way to regard V(y) as a conventional TD is to note that y can be viewed only as a rescaled conventional time t, and V(y) is the rescaled conventional TD with the unit Δ y and the constant discount factor . Second, interchangeability between the discrete time step and the discount factor arises once the hiding convention in the discrete-time formulation is used. Consider a period T to be modeled with conventional TD. The number of discrete steps is and with the first and second interpretations above, respectively; thus, we have n1=cn2. When we do not distinguish between Δ t and Δ y (denoting both by ) as with the hiding convention, the discount factor is

or , respectively, so we have . Thus, the same value function (see equation A.1) can be interpreted interchangeably with either n1 and γ1 or n2 and γ2.

Acknowledgments

We are grateful to S. Amari, K. Morita, and S. Suzuki for their comments on an early draft of the manuscript; to P. Bossaerts for his comments on choice reversal during the development of this work; and to J. Teramae for his comments on some citations. We are also very grateful for the reviewers' insightful comments. This study was partially supported by JSPS KAKENHI grant 21300129 and MEXT KAKENHI grant 20020034.

References

Acebron
,
J. A.
,
Bonilla
,
L. L.
,
Vicente
,
C. J. P.
,
Ritort
,
F.
, &
Spigler
,
R.
(
2005
).
The Kuramoto model: A simple paradigm for synchronization phenomena
.
Rev. Mod. Phys.
,
77
(
1
),
137
185
.
Ainslie
,
G.
(
1975
).
Specious reward: A behavioral theory of impulsiveness and impulse control
.
Psychol. Bull.
,
82
(
4
),
463
496
.
Arzy
,
S.
,
Molnar-Szakacs
,
I.
, &
Blanke
,
O.
(
2008
).
Self in time: Imagined self-location influences neural activity related to mental time travel
.
J. Neurosci.
,
28
(
25
),
6502
6507
.
Bayer
,
H. M.
, &
Glimcher
,
P. W.
(
2005
).
Midbrain dopamine neurons encode a quantitative reward prediction error signal
.
Neuron.
,
47
(
1
),
129
141
.
Berns
,
G. S.
,
Laibson
,
D.
, &
Loewenstein
,
G.
(
2007
).
Intertemporal choice—toward an integrative framework
.
Trends Cogn. Sci.
,
11
(
11
),
482
488
.
Boyer
,
P.
(
2008
).
Evolutionary economics of mental time travel?
Trends Cogn. Sci.
,
12
(
6
),
219
224
.
Bradtke
,
S. J.
, &
Duff
,
M. O.
(
1995
).
Reinforcement learning methods for continuous-time Markov decision problems
. In
G. Tesauro, D. S. Touretzky, & T. K. Leen
(Eds.),
Advances in neural information processing systems
,
7
(pp.
393
400
).
San Francisco
:
Morgan Kaufmann
.
Brown
,
J.
,
Bullock
,
D.
, &
Grossberg
,
S.
(
1999
).
How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues
.
J. Neurosci.
,
19
(
23
),
10502
10511
.
Buckner
,
R. L.
, &
Carroll
,
D. C.
(
2007
).
Self-projection and the brain
.
Trends Cogn. Sci.
,
11
(
2
),
49
57
.
Buhot
,
M. C.
(
1997
).
Serotonin receptors in cognitive behaviors
.
Curr. Opin. Neurobiol.
,
7
(
2
),
243
254
.
Buhusi
,
C. V.
, &
Meck
,
W. H.
(
2005
).
What makes us tick? Functional and neural mechanisms of interval timing
.
Nat. Rev. Neurosci.
,
6
(
10
),
755
765
.
Buonomano
,
D. V.
(
2007
).
The biology of time across different scales
.
Nat. Chem. Biol.
,
3
(
10
),
594
597
.
Cardinal
,
R. N.
(
2006
).
Neural systems implicated in delayed and probabilistic reinforcement
.
Neural Netw.
,
19
(
8
),
1277
1301
.
Cerutti
,
D. T.
, &
Staddon
,
J. E.
(
2004
).
Immediacy versus anticipated delay in the time-left experiment: A test of the cognitive hypothesis
.
J. Exp. Psychol. Anim. Behav. Process.
,
30
(
1
),
45
57
.
Church
,
R. M.
(
1984
).
Properties of the internal clock
.
Ann. N.Y. Acad. Sci.
,
423
,
566
582
.
Cools
,
R.
,
Roberts
,
A. C.
, &
Robbins
,
T. W.
(
2008
).
Serotoninergic regulation of emotional and behavioural control processes
.
Trends Cogn. Sci.
,
12
(
1
),
31
40
.
Daw
,
N. D.
,
Courville
,
A. C.
, &
Tourtezky
,
D. S.
(
2006
).
Representation and timing in theories of the dopamine system
.
Neural Comput.
,
18
(
7
),
1637
1677
.
Daw
,
N. D.
,
Kakade
,
S.
, &
Dayan
,
P.
(
2002
).
Opponent interactions between serotonin and dopamine
.
Neural Netw.
,
15
(
4–6
),
603
616
.
Daw
,
N. D.
,
Niv
,
Y.
, &
Dayan
,
P.
(
2005
).
Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control
.
Nat. Neurosci.
,
8
(
12
),
1704
1711
.
Daw
,
N. D.
, &
Touretzky
,
D. S.
(
2002
).
Long-term reward prediction in TD models of the dopamine system
.
Neural Comput.
,
14
(
11
),
2567
2583
.
Dayan
,
P.
(
2009
).
Prospective and retrospective temporal difference learning
.
Network
,
20
(
1
),
32
46
.
Dayan
,
P.
, &
Huys
,
Q. J.
(
2008
).
Serotonin, inhibition, and negative mood
.
PLoS. Comput. Biol.
,
4
(
2
),
e4
.
Dayan
,
P.
, &
Niv
,
Y.
(
2008
).
Reinforcement learning: The good, the bad and the ugly
.
Curr. Opin. Neurobiol.
,
18
(
2
),
185
196
.
Delamater
,
A. R.
, &
Oakeshott
,
S.
(
2007
).
Learning about multiple attributes of reward in Pavlovian conditioning
. In
B. Balleine, K. Doya, J. O'Doherty, & M. Sakagami
(Eds.),
Reward and decision making in corticobasal ganglia networks
(pp.
1
20
).
New York
:
Wiley-Blackwell
.
Denk
,
F.
,
Walton
,
M. E.
,
Jennings
,
K. A.
,
Sharp
,
T.
,
Rushworth
,
M. F.
, &
Bannerman
,
D. M.
(
2005
).
Differential involvement of serotonin and dopamine systems in cost-benefit decisions about delay or effort
.
Psychopharmacology (Berl.)
,
179
(
3
),
587
596
.
Desmond
,
J. E.
, &
Moore
,
J. W.
(
1988
).
Adaptive timing in neural networks: The conditioned response
.
Biol. Cybern.
,
58
(
6
),
405
415
.
Doya
,
K.
(
2000
).
Reinforcement learning in continuous time and space
.
Neural Comput.
,
12
(
1
),
219
245
.
Doya
,
K.
(
2002
).
Metalearning and neuromodulation
.
Neural Netw.
,
15
(
4–6
),
495
506
.
Doya
,
K.
(
2008
).
Modulators of decision making
.
Nat. Neurosci.
,
11
(
4
),
410
416
.
Doya
,
K.
,
Samejima
,
K.
,
Katagiri
,
K.
, &
Kawato
,
M.
(
2002
).
Multiple model-based reinforcement learning
.
Neural Comput.
,
14
(
6
),
1347
1369
.
Dragoi
,
V.
,
Staddon
,
J. E.
,
Palmer
,
R. G.
, &
Buhusi
,
C. V.
(
2003
).
Interval timing as an emergent learning property
.
Psychological Review
,
110
(
1
),
126
44
.
Eagleman
,
D. M.
(
2008
).
Human time perception and its illusions
.
Curr. Opin. Neurobiol.
,
18
(
2
),
131
136
.
Fiorillo
,
C. D.
,
Newsome
,
W. T.
, &
Schultz
,
W.
(
2008
).
The temporal precision of reward prediction in dopamine neurons
.
Nat. Neurosci.
,
11
(
8
),
966
973
.
Frederick
,
S.
,
Loewenstein
,
G.
, &
O'Donoghue
,
T.
(
2002
).
Time discounting and time preference: A critical review
.
Journal of Economic Literature
,
40
(
2
),
351
401
.
Fudenberg
,
D.
, &
Levine
,
D. K.
(
2006
).
A dual-self model of impulse control
.
Am. Econ. Rev.
,
96
(
5
),
1449
1476
.
Gallistel
,
C. R.
, &
Gibbon
,
J.
(
2000
).
Time, rate, and conditioning
.
Psychol. Rev.
,
107
(
2
),
289
344
.
Gallistel
,
C. R.
, &
Gibbon
,
J.
(
2002
).
The symbolic foundations of conditioned behavior
.
London
:
Routledge
.
Genovesio
,
A.
,
Tsujimoto
,
S.
, &
Wise
,
S. P.
(
2006
).
Neuronal activity related to elapsed time in prefrontal cortex
.
J. Neurophysiol.
,
95
(
5
),
3281
3285
.
Ghose
,
G. M.
, &
Maunsell
,
J. H.
(
2002
).
Attentional modulation in visual cortex depends on task timing
.
Nature
,
419
(
6907
),
616
620
.
Gibbon
,
J.
(
1977
).
Scalar expectancy theory and Weber's law in animal timing
.
Psychol. Rev.
,
84
(
3
),
279
325
.
Gibbon
,
J.
, &
Church
,
R. M.
(
1981
).
Time left: Linear versus logarithmic subjective time
.
J. Exp. Psychol. Anim. Behav. Process
,
7
(
2
),
87
107
.
Gibbon
,
J.
,
Malapani
,
C.
,
Dale
,
C. L.
, &
Gallistel
,
C.
(
1997
).
Toward a neurobiology of temporal cognition: Advances and challenges
.
Curr. Opin. Neurobiol.
,
7
(
2
),
170
184
.
Gilbert
,
D. T.
, &
Wilson
,
T. D.
(
2007
).
Prospection: Experiencing the future
.
Science
,
317
(
5843
),
1351
1354
.
Glascher
,
J.
, &
Buchel
,
C.
(
2005
).
Formal learning theory dissociates brain regions with different temporal integration
.
Neuron
,
47
(
2
),
295
306
.
Glimcher
,
P. W.
,
Kable
,
J.
, &
Louie
,
K.
(
2007
).
Neuroeconomic studies of impulsivity: Now or just as soon as possible?
Am. Econ. Rev.
,
97
(
2
),
142
147
.
Gold
,
J. I.
, &
Shadlen
,
M. N.
(
2007
).
The neural basis of decision making
.
Annu. Rev. Neurosci.
,
30
,
535
574
.
Grondin
,
S.
(
2001
).
From physical time to the first and second moments of psychological time
.
Psychological Bulletin
,
127
(
1
),
22
44
.
Grossberg
,
S.
, &
Schmajuk
,
N.
(
1989
).
Neural dynamics of adaptive timing and temporal discrimination during associative leaning
.
Neural Netw.
,
2
(
2
),
79
102
.
Hare
,
T. A.
,
O'Doherty
,
J.
,
Camerer
,
C. F.
,
Schultz
,
W.
, &
Rangel
,
A.
(
2008
).
Dissociating the role of the orbitofrontal cortex and the striatum in the computation of goal values and prediction errors
.
J. Neurosci.
,
28
(
22
),
5623
5630
.
Haruno
,
M.
, &
Kawato
,
M.
(
2006
).
Different neural correlates of reward expectation and reward expectation error in the putamen and caudate nucleus during stimulus-action-reward association learning
.
J. Neurophysiol.
,
95
(
2
),
948
959
.
Hazy
,
T. E.
,
Frank
,
M. J.
, &
O'Reilly
,
R. C.
(
2010
).
Neural mechanisms of acquired phasic dopamine responses in learning
.
Neurosci. Biobehav. Rev.
,
34
(
5
),
701
720
.
Hikosaka
,
O.
,
Nakamura
,
K.
, &
Nakahara
,
H.
(
2006
).
Basal ganglia orient eyes to reward
.
J. Neurophysiol.
,
95
(
2
),
567
584
.
Ho
,
M. Y.
,
Velázquez-Martínez
,
D. N.
,
Bradshaw
,
C. M.
, &
Szabadi
,
E.
(
2002
).
5-hydroxytryptamine and interval timing behaviour
.
Pharmacol. Biochem. Behav.
,
71
(
4
),
773
785
.
Houk
,
J. C.
,
Adams
,
J. L.
, &
Barto
,
A. G.
(
1995
).
A model of how basal ganglia generate and use neural signals that predict reinforcement
. In
J. C. Houk, J. L. Davis, & D. G. Beiser
(Eds.),
Models of information processing in the basal ganglia
(pp.
249
270
).
Cambridge, MA
:
MIT Press
.
Ivry
,
R. B.
, &
Spencer
,
R. M.
(
2004
).
The neural representation of time
.
Curr. Opin. Neurobiol.
,
14
(
2
),
225
232
.
Izhikevich
,
E. M.
(
2007
).
Solving the distal reward problem through linkage of STDP and dopamine signaling
.
Cereb. Cortex.
,
17
(
10
),
2443
2452
.
Jacobs
,
B. L.
, &
Azmitia
,
E. C.
(
1992
).
Structure and function of the brain serotonin system
.
Physiol. Rev.
,
72
(
1
),
165
229
.
Janssen
,
P.
, &
Shadlen
,
M. N.
(
2005
).
A representation of the hazard rate of elapsed time in macaque area LIP
.
Nat. Neurosci.
,
8
(
2
),
234
241
.
Kable
,
J. W.
, &
Glimcher
,
P. W.
(
2007
).
The neural correlates of subjective value during intertemporal choice
.
Nat. Neurosci.
,
10
(
12
),
1625
1633
.
Karmarkar
,
U. R.
, &
Buonomano
,
D. V.
(
2007
).
Timing in the absence of clocks: Encoding time in neural network states
.
Neuron.
,
53
(
3
),
427
438
.
Killeen
,
P. R.
, &
Fetterman
,
J. G.
(
1988
).
A behavioral theory of timing
.
Psychological Review
,
95
(
2
),
274
295
.
Killeen
,
P. R.
,
Fetterman
,
J. G.
, &
Bizo
,
L. A.
(
1997
).
Time's causes
. In
C. M. Bradshaw & E. Szabadi
(Eds.),
Time and behaviour: Psychological and neurobehavioural analyses
(pp.
79
132
).
Burlington, MA
:
Elsevier
.
Kobayashi
,
S.
, &
Schultz
,
W.
(
2008
).
Influence of reward delays on responses of dopamine neurons
.
J. Neurosci.
,
28
(
31
),
7837
7846
.
Kubota
,
Y.
,
Liu
,
J.
,
Hu
,
D.
,
DeCoteau
,
W. E.
,
Eden
,
U. T.
,
Smith
,
A. C.
, et al
(
2009
).
Stable encoding of task structure coexists with flexible coding of task events in sensorimotor striatum
.
J. Neurophysiol.
,
102
(
4
),
2142
2160
.
La Camera
,
G.
, &
Richmond
,
B. J.
(
2008
).
Modeling the violation of reward maximization and invariance in reinforcement schedules
.
PLoS. Comput. Biol.
,
4
(
8
),
e1000131
.
Laibson
,
D.
(
1997
).
Golden eggs and hyperbolic discounting
.
Q. J. Econ.
,
112
(
2
),
443
477
.
Leon
,
M. I.
, &
Shadlen
,
M. N.
(
2003
).
Representation of time by neurons in the posterior parietal cortex of the macaque
.
Neuron
,
38
(
2
),
317
327
.
Lewis
,
P. A.
, &
Miall
,
R. C.
(
2003
).
Distinct systems for automatic and cognitively controlled time measurement: Evidence from neuroimaging
.
Curr. Opin. Neurobiol.
,
13
(
2
),
250
255
.
Liberman
,
N.
, &
Trope
,
Y.
(
2008
).
The psychology of transcending the here and now
.
Science
,
322
(
5905
),
1201
1205
.
Loewenstein
,
G.
, &
Prelec
,
D.
(
1992
).
Anomalies in intertemporal choice: Evidence and an interpretation
.
Q. J. Econ.
,
107
(
2
),
573
597
.
Lohrenz
,
T.
,
McCabe
,
K.
,
Camerer
,
C. F.
, &
Montague
,
P. R.
(
2007
).
Neural signature of fictive learning signals in a sequential investment task
.
Proc. Natl. Acad. Sci. U.S.A.
,
104
(
22
),
9493
9498
.
Ludvig
,
E. A.
,
Sutton
,
R. S.
, &
Kehoe
,
E. J.
(
2008
).
Stimulus representation and the timing of reward-prediction errors in models of the dopamine system
.
Neural Comput.
,
20
(
12
),
3034
3054
.
Luhmann
,
C. C.
,
Chun
,
M. M.
,
Yi
,
D. J.
,
Lee
,
D.
, &
Wang
,
X. J.
(
2008
).
Neural dissociation of delay and uncertainty in intertemporal choice
.
J Neurosci.
,
28
(
53
),
14459
14466
.
Mahadevan
,
S.
(
1996
).
Average reward reinforcement learning: Foundations, algorithms, and empirical results
.
Machine Learning
,
22
(
1–3
),
159
195
.
Matell
,
M.
,
Bateson
,
M.
, &
Meck
,
W.
(
2006
).
Single-trial analyses demonstrate that increases in clock speed contribute to the methamphetamine-induced horizontal shifts in peak-interval timing functions
.
Psychopharmacology (Berl.)
,
188
(
2
),
201
212
.
Matell
,
M. S.
, &
Meck
,
W. H.
(
2000
).
Neuropsychological mechanisms of interval timing behavior
.
Bioessays
,
22
(
1
),
94
103
.
Mauk
,
M. D.
, &
Buonomano
,
D. V.
(
2004
).
The neural basis of temporal processing
.
Annu. Rev. Neurosci.
,
27
,
307
340
.
Mazur
,
J. E.
(
2006
).
Mathematical models and the experimental analysis of behavior
.
J. Exp. Anal. Behav.
,
85
(
2
),
275
291
.
McClure
,
S. M.
,
Ericson
,
K. M.
,
Laibson
,
D. I.
,
Loewenstein
,
G.
, &
Cohen
,
J. D.
(
2007
).
Time discounting for primary rewards
.
J. Neurosci.
,
27
(
21
),
5796
5804
.
McClure
,
S. M.
,
Laibson
,
D. I.
,
Loewenstein
,
G.
, &
Cohen
,
J. D.
(
2004
).
Separate neural systems value immediate and delayed monetary rewards
.
Science
,
306
(
5695
),
503
507
.
Meck
,
W. H.
(
1996
).
Neuropharmacology of timing and time perception
.
Cogn. Brain Res.
,
3
(
3–4
),
227
242
.
Miyazaki
,
K.
,
Miyazaki
,
K. W.
, &
Doya
,
K.
(
2007
).
Increased serotonin efflux in the dorsal raphe of rats working for delayed rewards
.
Presented at the 37th Annual Meeting of the Society for Neuroscience, San Diego, CA
.
Mizuhiki
,
K.
,
Inaba
,
K.
,
Toda
,
S.
,
Ozaki
,
S.
, &
Shidara
,
M.
(
2008
).
Single neurons in monkey dorsal raphe nucleus respond to reward schedules
.
Poster session presented at the 38th Annual Meeting of the Society for Neuroscience, Washington, DC
.
Mobini
,
S.
,
Chiang
,
T. J.
,
Al-Ruwaitea
,
A. S.
,
Ho
,
M. Y.
,
Bradshaw
,
C. M.
, &
Szabadi
,
E.
(
2000
).
Effect of central 5-hydroxytryptamine depletion on inter-temporal choice: A quantitative analysis
.
Psychopharmacology (Berl.)
,
149
(
3
),
313
318
.
Montague
,
P. R.
, &
Berns
,
G. S.
(
2002
).
Neural economics and the biological substrates of valuation
.
Neuron
,
36
(
2
),
265
284
.
Montague
,
P. R.
,
Dayan
,
P.
, &
Sejnowski
,
T. J.
(
1996
).
A framework for mesencephalic dopamine systems based on predictive Hebbian learning
.
J. Neurosci.
,
16
(
5
),
1936
1947
.
Montague
,
P. R.
,
Hyman
,
S. E.
, &
Cohen
,
J. D.
(
2004
).
Computational roles for dopamine in behavioural control
.
Nature
,
431
(
7010
),
760
767
.
Montague
,
P. R.
,
King-Casas
,
B.
, &
Cohen
,
J. D.
(
2006
).
Imaging valuation models in human choice
.
Annu. Rev. Neurosci.
,
29
,
417
448
.
Morris
,
G.
,
Nevet
,
A.
,
Arkadir
,
D.
,
Vaadia
,
E.
, &
Bergman
,
H.
(
2006
).
Midbrain dopamine neurons encode decisions for future action
.
Nat. Neurosci.
,
9
(
8
),
1057
1063
.
Nakahara
,
H.
,
Doya
,
K.
, &
Hikosaka
,
O.
(
2001
).
Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuomotor sequences—a computational approach
.
J. Cogn. Neurosci.
,
13
(
5
),
626
647
.
Nakahara
,
H.
,
Itoh
,
H.
,
Kawagoe
,
R.
,
Takikawa
,
Y.
, &
Hikosaka
,
O.
(
2004
).
Dopamine neurons can represent context-dependent prediction error
.
Neuron
,
41
(
2
),
269
280
.
Nakahara
,
H.
,
Nakamura
,
K.
, &
Hikosaka
,
O.
(
2006
).
Extended LATER model can account for trial-by-trial variability of both pre- and post-processes
.
Neural Netw.
,
19
(
8
),
1027
1046
.
Nakamura
,
K.
,
Matsumoto
,
M.
, &
Hikosaka
,
O.
(
2008
).
Reward-dependent modulation of neuronal activity in the primate dorsal raphe nucleus
.
J. Neurosci.
,
28
(
20
),
5331
5343
.
Niv
,
Y.
,
Daw
,
N.
, &
Dayan
,
P.
(
2006
).
Tonic dopamine: Opportunity costs and the control of response vigor
.
Psychopharmacology (Berl.)
,
191
(
3
),
507
520
.
Nobre
,
A.
,
Correa
,
A.
, &
Coull
,
J.
(
2007
).
The hazards of time
.
Curr. Opin. Neurobiol.
,
17
(
4
),
465
470
.
O'Doherty
,
J. P.
,
Dayan
,
P.
,
Friston
,
K.
,
Critchley
,
H.
, &
Dolan
,
R. J.
(
2003
).
Temporal difference models and reward-related learning in the human brain
.
Neuron
,
38
(
2
),
329
337
.
O'Reilly
,
R. C.
,
Frank
,
M. J.
,
Hazy
,
T. E.
, &
Watz
,
B.
(
2007
).
PVLV: The primary value and learned value Pavlovian learning algorithm
.
Behav. Neurosci.
,
121
(
1
),
31
49
.
Pan
,
W. X.
,
Schmidt
,
R.
,
Wickens
,
J. R.
, &
Hyland
,
B. I.
(
2005
).
Dopamine cells respond to predicted events during classical conditioning: Evidence for eligibility traces in the reward-learning network
.
J. Neurosci.
,
25
(
26
),
6235
6242
.
Potjans
,
W.
,
Morrison
,
A.
, &
Diesmann
,
M.
(
2009
).
A spiking neural network model of an actor-critic learning agent
.
Neural Comput.
,
21
(
2
),
301
339
.
Preuschoff
,
K.
,
Bossaerts
,
P.
, &
Quartz
,
S. R.
(
2006
).
Neural differentiation of expected reward and risk in human subcortical structures
.
Neuron
,
51
(
3
),
381
390
.
Puterman
,
M. L.
(
1994
).
Markov decision processes: Discrete stochastic dynamic programming
.
Hoboken, NJ
:
Wiley
.
Rammsayer
,
T. H.
(
1999
).
Neuropharmacological evidence for different timing mechanisms in humans
.
Q. J. Exp. Psychol. B
,
52
(
3
),
273
286
.
Ranade
,
S. P.
, &
Mainen
,
Z. F.
(
2009
).
Transient firing of dorsal raphe neurons encodes diverse and specific sensory, motor, and reward events
.
J. Neurophysiol.
,
102
(
5
),
3026
3037
.
Redgrave
,
P.
, &
Gurney
,
K.
(
2006
).
The short-latency dopamine signal: A role in discovering novel actions?
Nat. Rev. Neurosci.
,
7
(
12
),
967
975
.
Roesch
,
M. R.
, &
Olson
,
C. R.
(
2005
).
Neuronal activity dependent on anticipated and elapsed delay in macaque prefrontal cortex, frontal and supplementary eye fields, and premotor cortex
.
J. Neurophysiol.
,
94
(
2
),
1469
1497
.
Rubinstein
,
A.
(
2003
).
Economics and psychology? The case of hyperbolic discounting
.
Int. Econ. Rev.
,
44
(
4
),
1207
1216
.
Samuelson
,
P. A.
(
1937
).
A note on measurement of utility
.
Rev. Econ. Studies
,
4
(
2
),
155
161
.
Schacter
,
D. L.
,
Addis
,
D. R.
, &
Buckner
,
R. L.
(
2007
).
Remembering the past to imagine the future: The prospective brain
.
Nat. Rev. Neurosci.
,
8
(
9
),
657
661
.
Schultz
,
W.
(
1998
).
Predictive reward signal of dopamine neurons
.
J. Neurophysiol.
,
80
(
1
),
1
27
.
Schultz
,
W.
,
Dayan
,
P.
, &
Montague
,
P. R.
(
1997
).
A neural substrate of prediction and reward
.
Science
,
275
(
5306
),
1593
1599
.
Schwartz
,
A.
(
1993
).
A reinforcement learning method for maximizing undiscounted rewards
. In
Proceedings of the Tenth International Conference in Machine Learning
(pp.
298
305
).
San Francisco
:
Morgan Kaufmann
.
Schweighofer
,
N.
,
Bertin
,
M.
,
Shishida
,
K.
,
Okamoto
,
Y.
,
Tanaka
,
S. C.
,
Yamawaki
,
S.
, et al
(
2008
).
Low-serotonin levels increase delayed reward discounting in humans
.
J. Neurosci.
,
28
(
17
),
4528
4532
.
Schweighofer
,
N.
,
Shishida
,
K.
,
Han
,
C. E.
,
Okamoto
,
Y.
,
Tanaka
,
S. C.
,
Yamawaki
,
S.
, et al
(
2006
).
Humans can adopt optimal discounting strategy under real-time constraints
.
PLoS Comput. Biol.
,
2
(
11
),
e152
.
Sershen
,
H.
,
Hashim
,
A.
, &
Lajtha
,
A.
(
2000
).
Serotonin-mediated striatal dopamine release involves the dopamine uptake site and the serotonin receptor
.
Brain Res. Bull.
,
53
(
3
),
353
357
.
Seymour
,
B.
,
Daw
,
N. D.
,
Dayan
,
P.
,
Singer
,
T.
, &
Dolan
,
R. J.
(
2007
).
Differential encoding of losses and gains in the human striatum
.
J. Neurosci.
,
27
(
18
),
4826
4831
.
Solomon
,
R. B.
,
Conover
,
K. L.
, &
Shizgal
,
P.
(
2007
).
Estimation of subjective opportunity costs in rats working for rewarding brain stimulation: Further progress
.
Presented at the 37th annual meeting of the Society for Neuroscience, San Diego, CA
.
Soltani
,
A.
, &
Wang
,
X. J.
(
2010
).
Synaptic computation underlying probabilistic inference
.
Nat. Neurosci.
,
13
(
1
),
112
119
.
Staddon
,
J. E.
, &
Higa
,
J. J.
(
1999
).
Time and memory: Towards a pacemaker-free theory of interval timing
.
J. Exp. Anal. Behav.
,
71
(
2
),
215
251
.
Suri
,
R.
(
2001
).
Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model
.
Exp. Brain Res.
,
140
(
2
),
234
240
.
Suri
,
R. E.
, &
Schultz
,
W.
(
2001
).
Temporal difference model reproduces anticipatory neural activity
.
Neural Comput.
,
13
(
4
),
841
862
.
Sutton
,
R. S.
(
1995
).
TD models: Modeling the world at a mixture of time scales
. In
A. Prieditis & S. Russell
(Eds.),
Proceedings of the Twelfth International Conference in Machine Learning
(pp.
531
539
).
San Francisco
:
Morgan Kaufmann
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1990
).
Time-derivative models of Pavlovian reinforcement
. In
M. Gabriel & J. Moore
(Eds.),
Learning and computational neuroscience: Foundations of adaptive networks
(pp.
497
537
).
Cambridge, MA
:
MIT Press
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
1998
).
Reinforcement learning: An introduction
.
Cambridge, MA
:
MIT Press
.
Szpunar
,
K. K.
,
Watson
,
J. M.
, &
McDermott
,
K. B.
(
2007
).
Neural substrates of envisioning the future
.
Proc. Natl. Acad. Sci. U.S.A.
,
104
(
2
),
642
647
.
Takahashi
,
T.
(
2005
).
Loss of self-control in intertemporal choice may be attributable to logarithmic time-perception
.
Med. Hypotheses
,
65
(
4
),
691
693
.
Tan
,
C. O.
, &
Bullock
,
D.
(
2008
).
A dopamine-acetylcholine cascade: Simulating learned and lesion-induced behavior of striatal cholinergic interneurons
.
J. Neurophysiol.
,
100
(
4
),
2409
2421
.
Tanaka
,
S. C.
,
Doya
,
K.
,
Okada
,
G.
,
Ueda
,
K.
,
Okamoto
,
Y.
, &
Yamawaki
,
S.
(
2004
).
Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops
.
Nat. Neurosci.
,
7
(
8
),
887
893
.
Thaler
,
R. H.
, &
Shefrin
,
H. M.
(
1981
).
An economic theory of self-control
.
J. Polit. Econ.
,
89
(
2
),
392
406
.
Tsitsiklis
,
J. N.
, &
Van Roy
,
B.
(
2002
).
On average versus discounted reward temporal-difference learning
.
Machine Learning
,
49
(
2–3
),
179
191
.
Winstanley
,
C. A.
,
Eagle
,
D. M.
, &
Robbins
,
T. W.
(
2006
).
Behavioral models of impulsivity in relation to ADHD: Translation between clinical and preclinical studies
.
Clin. Psychol. Rev.
,
26
(
4
),
379
395
.
Wise
,
R. A.
(
2004
).
Dopamine, learning and motivation
.
Nat. Rev. Neurosci.
,
5
(
6
),
483
494
.
Wittmann
,
M.
, &
Paulus
,
M. P.
(
2008
).
Decision making, impulsivity and time perception
.
Trends Cogn. Sci.
,
12
(
1
),
7
12
.
Yamada
,
H.
,
Matsumoto
,
N.
, &
Kimura
,
M.
(
2007
).
History- and current instruction-based coding of forthcoming behavioral outcomes in the striatum
.
J. Neurophysiol.
,
98
(
6
),
3557
3567
.
Zeiler
,
M. D.
(
1998
).
On sundials, springs, and atoms
.
Behav. Processes.
,
44
(
2
),
89
99
.
Zhang
,
K.
(
1996
).
Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory
.
J. Neurosci.
,
16
(
6
),
2112
2126
.