## Abstract

The temporal difference (TD) learning framework is a major paradigm for understanding value-based decision making and related neural activities (e.g., dopamine activity). The representation of time in neural processes modeled by a TD framework, however, is poorly understood. To address this issue, we propose a TD formulation that separates the time of the operator (neural valuation processes), which we refer to as internal time, from the time of the observer (experiment), which we refer to as conventional time. We provide the formulation and theoretical characteristics of this TD model based on internal time, called internal-time TD, and explore the possible consequences of the use of this model in neural value-based decision making. Due to the separation of the two times, internal-time TD computations, such as TD error, are expressed differently, depending on both the time frame and time unit. We examine this operator-observer problem in relation to the time representation used in previous TD models. An internal time TD value function exhibits the co-appearance of exponential and hyperbolic discounting at different delays in intertemporal choice tasks. We further examine the effects of internal time noise on TD error, the dynamic construction of internal time, and the modulation of internal time with the internal time hypothesis of serotonin function. We also relate the internal TD formulation to research on interval timing and subjective time.

## 1. Introduction

The framework of temporal difference (TD) learning is central to the reinforcement learning paradigm (Sutton & Barto, 1998) and has become a major platform for investigating the neural basis of value-based decision making and reward-oriented behavior. The TD framework has greatly improved our understanding of these neural functions and related neural activities, most notably that of dopamine (DA) neurons (Schultz, 1998; Montague, Hyman, & Cohen, 2004; Hikosaka, Nakamura, & Nakahara, 2006). The representation of time in TD learning models for neural value-based decision making, however, remains poorly understood (Gibbon, Malapani, Dale, & Gallistel, 1997; Dayan & Niv, 2008). To understand the effects of time representation on neural valuation, we propose separating the time used in experiments (the time of the observer) from the time of neural valuation processes (or TD model, the time of the operator). We reformulated the TD learning framework using the operator's time and found that this new framework helps to clarify several issues in neural valuation. Here we explain our motivation for conducting this study by describing issues related to our subject.

First, when TD models are applied to neural valuation, they usually use a discrete time formulation. Therefore, we must clarify the relation between discrete time and continuous time, because time by nature is continuous, and experiments are thus intrinsically performed in continuous time. Accordingly, in this study, we make a clear connection between discrete-time and continuous-time TD models.

Second, when discrete-time TD models are applied to neural valuation, two essential roots are used for the treatment of time: time is either an operational unit or a duration unit, or both. Later, we demonstrate that a discrete-time unit, once connected to continuous time, plays a dual role as a unit for both operation and duration (see section 2.2). Conceptually, once we view a TD model's operations in the light of Markov decision processes (MDPs) (Sutton & Barto, 1998), a discrete-time unit acts as an operational unit, because each increment of the unit induces almost all the operations involved in the TD model (e.g., state transition). In relation to continuous time, a discrete-time step may also act as a duration unit. A clear example is the tapped-delay-line representation for modeling conditioning behavior (e.g., Desmond & Moore, 1988; Sutton & Barto, 1990): all time steps are typically considered to have the same length in continuous time and are used as placeholders between externally salient events. These events are considered to evoke a series of inputs to the TD models, which act as inputs in the time steps between the events and thus become time representations in the steps. Markov properties are usually assumed with respect to these events (Sutton & Barto, 1990; Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997). Provided that this assumption is valid and the event-evoked inputs representing time have sufficient representational capacity, we can treat the times between events under MDPs for TD models.

With this understanding, we observe that for application to neural valuation, time evolution is not a generic part of the discrete-time TD formulation using MDPs (Daw, Courville, & Tourtezky, 2006, being a notable exception). To address this issue, this study sets the time evolution of neural valuation processes directly in the TD formulation. For this purpose, it is important to note that time is a difficult concept to define, but at the very least, it must be an entity that is independent of the methods of measurement or coordinate systems. Given this invariant principle, we must mention that time, as we think we know it, is our own time system; it is indeed only one method of measuring time. We call this time system *conventional time*, and it is regarded as the time of the experiment or the observer, as we observe, record, and discuss experiments and their results mostly in conventional time.

Third, the discrete-time unit of neural valuation's operation process can be different from the conventional discrete-time unit. Results from a wide range of experiments are broadly consistent with this view, although they may be directly focused not on operations of neural valuation but only on neural operations in general. Neural time processing is not a unitary system but involves several neural systems (Ivry & Spencer, 2004; Mauk & Buonomano, 2004). Various systems differentially contribute to different timescales (Rammsayer, 1999; Buhusi & Meck, 2005; Buonomano, 2007). Time may be processed differently for different modalities, yielding perception, decision, or motor times (Lewis & Miall, 2003; Gold & Shadlen, 2007; Nobre, Correa, & Coull, 2007; Eagleman, 2008). Time processing often contains errors, characterized by Weber's law, the scalar property (Gibbon, 1977; Gallistel & Gibbon, 2000), and other such errors (Matell & Meck, 2000; Nakahara, Nakamura, & Hikosaka, 2006). Even the same time interval can be processed differently, depending on the initial neural activity of the process (Karmarkar & Buonomano, 2007). Furthermore, consider the prospective use of time on a long timescale. If the conventional-time unit is the operator's unit, it must always be processed in the same way at any time; a second—whether now or 1 month later—must be processed in the same way. This proposition, admittedly naively stated, is somewhat difficult to accept. Indeed, several studies point to different neural time processing of such prospections (e.g., mental time travel, constructing representations of future events, and similar mental activities regarding the future (Gilbert & Wilson, 2007; Szpunar, Watson, & McDermott, 2007; Buckner & Carroll, 2007; Arzy, Molnar-Szakacs, & Blanke, 2008; Boyer, 2008; Liberman & Trope, 2008).

Taking these considerations into account, in this study, we treat the time system of the neural valuation's operation process as distinct from the system of the observer, or the conventional time system. For simplicity and clarity, we call the time of the neural valuation's operation process in the TD formulation (or the operator's time) specifically *internal time*. This study investigates what we call the internal-time TD formulation (internal TD) that constructs a TD model using internal time, together with its theoretical characteristics and possible consequences in neural valuation. We concentrate on the most basic formulation of the TD model— the case of discounted rewards without an eligibility trace. We also contrast internal TD with what we call conventional-time TD (or conventional TD), which is formulated using conventional time. We are particularly interested in how internal TD behaves in discrete conventional time—the time in which experimental observations are most likely made. Two fundamental elements of TD models are extensively used when applied to neural valuation: one is exponential discounting, which is a long-timescale property of a TD model, and the other is reward prediction error (TD error), which is a short-timescale property of a TD model reflecting its nature as an online learning algorithm. We investigate both properties of the internal TD.

Clarification at this point may be helpful for relating this study to issues investigated in the interval timing literature (Gibbon et al., 1997; Buhusi & Meck, 2005). In our view, while the issues examined in this study and the interval timing literature are closely related, the focus of this study is distinct from that of interval timing. A central issue in the interval timing literature is how time intervals are subjectively timed (e.g., estimated, produced, or reproduced by subjects); accordingly, these studies often distinguish between objective and subjective time, whereby objective time is the time of the experiment/observer (what we call conventional time) and subjective time is considered different from objective time. Subjective time is used as a theoretical construct to explain the subjective timings or timing behaviors of the subjects. Several different mappings between the two time systems are proposed in the literature, (e.g. Gibbon, 1977; Killeen & Fetterman, 1988; Staddon & Higa, 1999; Buhusi & Meck, 2005). Most timing models in the literature are used to examine these possible mappings and the implications for behavior through the perspective of timing behaviors (Church, 1984; Killeen & Fetterman, 1988; Staddon & Higa, 1999; Gallistel & Gibbon, 2000; Grondin, 2001; Dragoi, Staddon, Palmer, & Buhusi, 2003; Cerutti & Staddon, 2004; Buhusi & Meck, 2005). In other words, a distinction between subjective and objective time is the distinction between two different time systems through timing behaviors. On the other hand, such subjective timing or timing behavior is not a generic part of conventional TD models applied to neural valuation because conventional TD models regard valuation and action selection as primary interests and time is treated as only an auxiliary variable. Thus, although studies of interval timing and studies using these TD models often examine related or similar experimental tasks and thus often ask closely related questions, they address different central questions (Montague et al., 1996; Gallistel & Gibbon, 2002), such as, subjective timing versus valuation (Daw et al., 2006, being a notable exception). This study investigates a most basic TD formulation using internal time, which is distinct from conventional time, in that internal time is defined as the time of neural valuation processes to which TD models are applied. In this sense, the distinction between conventional and internal time is the one between two time systems through neural valuation. With this perspective, we consider that the two distinctions (objective versus subjective time and conventional versus internal time) are closely related because studies of timing models and TD models are often closely related to each other, and thus, if taken rather broadly, the two distinctions might appear to be the same. In this study, however, we consider it beneficial to maintain the different distinctions in a strict sense, because the two lines of studies approach central questions differently, and it is currently unclear how subjective timing should be mapped onto or treated as part of neural valuation or internal time under TD formulation. In section 5, we consider how subjective timing can be included in the internal TD formulation for future research.

We first summarize most formulations of the internal-time TD (see section 2), including different discrete-time TD error expressions and what we call the operator-observer problem. In section 3, we investigate the implications of internal TD for neural valuation. One implication is the relationships of different TD error expressions to the time representations used in previous TD models (see section 3.1). In section 3.2, using intertemporal choice tasks, an internal TD is shown to exhibit the co-appearance of exponential and hyperbolic discounting at different delays, which accounts for the observed choice reversal, and also to be decomposed into multiple subsystems. Further consideration is given to the short-timescale properties of internal TD (see section 4), namely, the effect of the ongoing noise in internal time on TD error (see section 4.1), dynamic internal time construction (see section 4.2), and internal time modulation, along with an internal-time hypothesis of serotonin neuronal functions (see section 4.3). Finally, section 5 contains the discussion.

## 2. Internal-Time TD Formulation

We begin in section 2.1 by clarifying the general relationship between two time systems and then the relationship of the discrete time used by TD models to continuous time. This provides a basis for comparing internal and conventional TDs in section 2.2. We then summarize different expressions of discrete-time internal TD error in section 2.3 and the operator-observer problem in section 2.4.

### 2.1. Preliminaries.

*y*and

*t*and their discrete-time constant units Δ

*y*and Δ

*t*respectively. Although in later sections, we specify

*y*and

*t*as internal and conventional time, respectively, they can be regarded in this section as any time system. The fundamental relationship between any two time systems is schematically shown in Figure 1. It is critical to understand that a constant unit of one time system may have different lengths in another time system. This characteristic affects our understanding of neural valuation when the operator's time is distinct from the observer's time. To see this nature mathematically, let us define the relationship between the two time systems (

*y*-system and

*t*-system, respectively) by where

*f*is assumed to be monotonically increasing and differentiable (this second condition is for simplicity, because it can be relaxed), thus yielding Note that in general, to define a function

*y*=

*f*(

*t*), we must first define the reference frame and origin (e.g., if the origin of

*t*, or more generally if the reference point of

*t*changes, the form of

*f*(.) changes). Hence, we also sometimes write , where is the reference point (origin), but only when clarification is necessary in sections 3.2 and 4.2. When the

*y*-system is represented by the

*t*-system (i.e., when Δ

*t*is constant; see Figure 1, bottom), Δ

*y*becomes a function of

*t*, , and thus equation 2.1 is read as . Consequently, varies according to . In the opposite case, when Δ

*y*is constant (when the

*t*-system is represented by the

*y*-system), equation 2.1 is read as so that now varies according to .

*y*(i.e., ) and subtracting both sides, the continuous-time TD error is given as Let us set Δ

*y*as the discretizing unit. We substitute in equation 2.4, together with one increment and have This equation indicates the standard relationship of the two TD errors. Comparing equation 2.4with equation 2.5, we see the intrinsic dimension of both reward function

*r*(

*y*) and δ (

*y*) density (per time), whereas

*k*(

*y*) and

*V*(

*y*) are scalar functions. This affects discrete-time TD error expressions.

*y*) was specifically chosen to be the same as the unit of the time system for the continuous-time value function in equation 2.3. It is generally possible, however, to choose a discretizing unit that is different from the value function's unit. In this case, using the discretizing unit from another time system, Δ

*t*, and writing

*y*=

*y*(

*t*)=

*f*(

*t*), we have This equation represents a more general relationship of the discrete-time TD error with the continuous-time value function or TD error. It is noteworthy that in equation 2.6, the discretizing unit Δ

*y*(

*t*) is now variable in the

*y*-system.

### 2.2. Internal-Time and Conventional-Time TD Formulations.

*y*specifically to indicate internal time and

*t*to indicate conventional time. To contrast the two TD models, we first briefly summarize conventional TD. Conventional TD does not distinguish

*y*and

*t*, thus effectively constructing its continuous-time value function in conventional time. When

*y*=

*t*is set in equation 2.3, the value function is where we used

*s*inside the integral to indicate integration in the

*t*-system, while reserving

*x*for integration in the

*y*-system (e.g., equation 2.3). When we set

*y*=

*t*and Δ

*y*= Δ

*t*, the discrete-time TD error is given by equation 2.5. To write it in a familiar form, we first define an index function

*i*(

*t*) that returns the index of the discrete time steps, given a continuous time

*t*(in the

*t*-system);

*i*(

*t*) =

*t*/Δ

*t*and in the continuous time,

*i*(

*t*) refers to a period of continuous time [

*t*−Δ

*t*,

*t*]. Then the TD error is given by where we dropped Δ

*t*on the left-hand side (i.e., from δ

_{i(t + Δ t)}Δ

*t*) for simplicity and define the discount factor, γ, by Because Δ

*t*is conventionally hidden, equation 2.7 is usually written as where we now have .

Here Δ *t*, being a fixed constant, plays dual roles. First, it acts as a duration unit, thereby connecting continuous time to discrete time and permitting the use of a hiding convention. Second, it is also an iteration or operation unit of the discrete-time TD model, as almost all TD operations are defined with respect to iteration and, moreover, being constant, it also makes the discount factor constant. The TD error expressed in the hiding convention equation 2.8, hides these issues.

*y*from

*t*, the internal TD is constructed using the internal time system. Here we consider

*y*≠

*t*in general, or at the very least, that the conventional TD's assumption

*y*=

*t*does not always hold. The continuous-time internal TD value function is given by equation 2.3. It is rewritten in the

*t*-system as

*V*(

*y*), an exponentially discounted function in the

*y*-system, is no longer necessarily exponentially discounted in the

*t*-system. This simple example restates the fact that even before discretization, a function constructed in one time system can be expressed differently in another time system.

*j*(

*y*) using the

*y*-system (i.e., Δ

*y*), such that

*j*(

*y*) =

*y*/Δ

*y*and

*j*(

*y*) refers to the period [

*y*−Δ

*y, y*], and the discount factor is given by Then, the internal TD's discrete-time TD error is given by where Δ

*y*is dropped on the left-hand side. If we use the hiding convention, equation 2.9 is written as , which, superficially, is identical to equation 2.8. However, they can be different TD errors, depending on the function

*f*.

### 2.3. Different Expressions of Discrete-Time Internal TD Error.

Here we summarize different expressions and meanings of discrete-time TD errors of the internal TD (see Table 1). The internal TD's discrete-time TD error in equation 2.9 is discretized by using Δ *y* and expressed in the *y*-system (see Table 1). This TD error may be regarded as a sort of proper discrete-time TD error of the internal TD, in the sense that the same time system (internal time) underlies both the value function construction and the discretizing unit. The same discrete-time TD error can still be expressed in the *t*-system, and it is given in Table 1 (bottom-left entry), which we can derive by using *t*=*t*(*y*)=*f*^{−1}(*y*) and and noting that for any *t* (due to equation 2.2).

. | Operator's View: Unit Δ y
. | Observer's View: Unit Δ t
. | ||
---|---|---|---|---|

Time System . | . | . | ||

to Express . | Duration . | Discount Factor . | Duration . | Discount Factor . |

Internal time | Δ y, constant | γ_{y}, constant | Δ y(t), variable | variable |

Conventional time | Δ t(y), variable | constant | constant | variable |

. | Operator's View: Unit Δ y
. | Observer's View: Unit Δ t
. | ||
---|---|---|---|---|

Time System . | . | . | ||

to Express . | Duration . | Discount Factor . | Duration . | Discount Factor . |

Internal time | Δ y, constant | γ_{y}, constant | Δ y(t), variable | variable |

Conventional time | Δ t(y), variable | constant | constant | variable |

The continuous-time internal TD value function can also be discretized using the unit of the *t*-system, Δ *t*, as shown in equation 2.6. In this case, we use the index function for the *t*-system, *i*(*t*), and also note that . Then equation 2.6 can be expressed in both the *t*- and *y*-systems as shown in Table 1 (right column).

### 2.4. Different TD Error Expressions and Operator-Observer Problems.

We found that the discrete-time TD error expression of the internal TD depends on both the time system in which it is expressed (*y*- or *t*-system) and the time unit by which it is discretized (Δ *y* or Δ *t*), leading to 2 × 2 expressions (see Table 1). In contrast, the conventional TD error (see equation 2.7) is a special case of the internal TD, when we do not distinguish between the operator's and the observer's times (i.e., equating the two time systems, *y*=*t*, or effectively a linear function, *y*=*ct*, where *c* is a constant; see the appendix). In this case, all TD error expressions become equivalent to one another (see equation 2.7).

Dissociating the two time systems raises several issues. Most importanty, this dissociation, together with the internal TD formulation, makes what we call the operator-observer problem explicit: we should clarify whether the model is constructed using the operator's or the observer's point of view. Taking the operator's view, the model is used to directly formulate neural valuation processes per se. In contrast, taking the observer's view, the model is used to construct a description of the processes but not necessarily to directly formulate the processes themselves.

The time of the operator or neural valuation processes is by definition internal time, and therefore, we construct value functions in internal time, not in conventional time, under our internal TD formulation. Given continuous-time internal TD value functions, the distinction between the operator's and the observer's views differentiates what discrete-time internal TD error expressions should be used for modeling. The internal time unit (Δ *y*) and the conventional time unit (Δ *t*) have specific meanings for the internal TD error expressions: the former is the operator's unit and the latter is the observer's unit. The TD error in the operator's view is expressed differently, depending on whether it is expressed in the *y*- or *t*-system (left-side entries in Table 1). The discount factor is the same as γ_{y} in both expressions, while the duration is constant (Δ *y*) in internal time but variable (Δ *t*(*y*); see equation 2.2) in conventional time. The TD error in the observer's view is shown on the right side of Table 1. The discount factor is no longer constant but is a function of *t*. Duration is constant (Δ *t*) in the *t*-system but variable (Δ *y*(*t*)) in the *y*-system. The difference in the changes over the two time systems between the two views comes from the fact that *r*(*y*) and *k*(*y*) intrinsically have different dimensions. Because experimental observations are usually made, and thus fitted by TD models, mostly in conventional time, it is useful to compare the TD error expressions of the operator's and the observer's views in conventional time (bottom entries in Table 1). A succinct summary (but noting that internal time also affects the value function) is that the effect of internal time is expressed by a variable Δ *t* (*y*) in the operator's view and by a variable discount factor in the observer's view. This characteristic is partly related to the interchangeability between the discrete-time unit and the discount factor, which exists even within the conventional TD (see the appendix).

## 3. Implications of Internal TD for Neural Valuation

### 3.1. Relationships of Different Time Representations of Previous TD Models to Different Internal TD Error Expressions.

#### 3.1.1. Previous TD Models Using Different Time Representations.

The internal TD formulation provides us with the flexibility to unify the different time representations of previous TD models used in neural valuation studies. Among these, we focus on three representative approaches: the equally dissected, event-dissected, and semi-Markov models (see Table 2), based on their characterizations in conventional time, because the distinction between internal and conventional time was not made in previous studies. The three approaches are summarized based on how they treat time as well as on their forms of TD error.

Type . | Duration . | Discount Factor . | TD Error . |
---|---|---|---|

Equally dissected | Δ t | γ | |

Event-dissected | Δ t(y) | γ_{y} | |

Semi-Markov | Δ t(y) | ||

Type . | Duration . | Discount Factor . | TD Error . |
---|---|---|---|

Equally dissected | Δ t | γ | |

Event-dissected | Δ t(y) | γ_{y} | |

Semi-Markov | Δ t(y) | ||

The equally dissected model uses an equally sized or constant unit for time representation in conventional time, thus maintaining a rigid relationship between a discrete-time step and a small continuous-time period. It thus uses Δ *y* = Δ *t* or, equivalently, , and the corresponding TD error is given by equation 2.7 (see Table 2). A representative model in this approach is often called a tapped-delay-line model (e.g., Desmond & Moore, 1988; Sutton & Barto, 1990). The tapped-delay-line model treats time representation as a part of the stimulus representation and has been adopted for modeling DA activity by TD models since it was first proposed (Houk, Adams, & Barto, 1995; Montague et al., 1996; Schultz et al., 1997), and has been extensively used in subsequent studies, including the examination of various neurophysiological and function magnetic resonance imaging (fMRI) studies on neural valuation (Suri, 2001; Suri & Schultz, 2001; O'Doherty, Dayan, Friston, Critchley, & Dolan, 2003; Tanaka et al., 2004; Pan, Schmidt, Wickens, & Hyland, 2005).

_{k}, in the

*t*-system by , we can write , where

*y*and

_{k}*t*refer to the onset of an event

_{k}_{k}in the

*y*- and

*t*-systems, respectively. Then we can write and the corresponding TD error is then given as in Table 2.

_{k}, causes a state transition. At the transition, the duration of the state, , is probabilistically determined, where the underlying probability distribution is prespecified (possibly differently) with respect to each event

_{k}. Thus, is written as , where

*z*is a sample of a random variable

_{k}*Z*of the prespecified distribution

_{k}*P*. The semi-Markov model is then expressed as an internal TD by In the case of discounted rewards, the discount factor is arranged to reflect variable durations (Puterman, 1994; Bradtke & Duff, 1995) (and reward as well, which we omit in the following) by using the constant unit Δ

_{k}*t*, so the discount factor is re-expressed as , and thus the corresponding TD error (see Table 2) is given by As mentioned earlier, Daw et al. (2006) did not use this TD error but rather a TD error in the average TD. This complicates the direct comparison of their semi-Markov average TD formulation with our internal TD formulation. Hence, we examine the semi-Markov approach under the discounted case in the subsequent sections (but also see section 5).

#### 3.1.2. Remarks on Different Expressions for Internal TD Error and on Time Representations in Previous TD Models.

The three models just discussed collectively represent the approaches most typically used for time representation when the operator's and the observer's times are not dissociated. If we forced the discrete-time conventional TD model to have a rigid relation to continuous time, it would be an equally dissected TD model. For example, the TD or related models used in fMRI studies are considered to be of this type if all sampled time points of the model have equal durations, including the intertrial interval (ITI). The model's time representation implies that any variability in the time events in an experiment affects the behavior of the TD model, possibly making it difficult for the model to account for neural data. A simple example can be seen in experiments with a variable ITI (Daw et al., 2006). This is usually avoided by using an episodic task schedule (Sutton & Barto, 1998), whereby the model's simulation is run, ignoring the variable ITI. An equally dissected model with such an episodic schedule becomes similar to an event-dissected model once the ignored variable ITI is considered to be part of the simulations, which have variable durations, by the latter model. Accordingly, the distinction between these two models is often blurred. If we let the discrete-time conventional TD model have a rather flexible relation to duration, it would become an event-dissected or semi-Markov model (see Table 2), and it is possible to have an event-dissected model determine the variable duration in the same way as in the semi-Markov model and, conversely, let the semi-Markov model determine the duration using the same method as the event-dissected model. Therefore, the most critical difference between the semi-Markov and event-dissected models is whether the discount factor is adjusted by each duration (measured by conventional time). The semi-Markov model treats a unit of operation (or duration) in the same way as the event-dissected TD does but computes the discount factor using conventional instead of internal time.

Note that the internal TD errors from the operator's view, expressed in internal time (see Table 1), may correspond to the TD error of the equally dissected internal TD (see Table 2). Once re-expressed in conventional time, the equally dissected internal TD error has the same form as the event-dissected TD error (see Table 2). This implies that the event-dissected TD model is a subclass of the equally dissected internal TD model, because it specifically defines the operator's unit as corresponding to duration among external events. Thus, this provides one justification, albeit only partially, for using the event-dissected TD model to examine neural valuation; the model makes an explicit choice using variable durations as units of the neural operation process.

### 3.2. Value Function Discounted Exponentially in Short Delay and Hyperbolically in Long Delay.

The TD framework (in the case of discounted rewards) suggests that a temporally distant reward is valued with exponential discounting (Sutton & Barto, 1998). More generally, exponential discounting has theoretically favored properties, often regarded as “rational” (Samuelson, 1937; Montague & Berns, 2002; Mazur, 2006). Studies in psychology, neuroscience, and economics, however, often indicate that the subject's behavior reveals not exponential but hyperbolic discounting (Ainslie, 1975; Thaler & Shefrin, 1981). Questions regarding whether exponential or hyperbolic discounting is appropriate have been long debated (Loewenstein & Prelec, 1992; Frederick, Loewenstein, & O'Donoghue, 2002; Berns, Laibson, & Loewenstein, 2007), and recently fMRI studies, in which investigators directly compare behavior with neural (BOLD) activity in different brain areas, have highlighted these issues (Montague, King-Casas, & Cohen, 2006).

A hallmark of hyperbolic discounting is choice reversal in intertemporal choice tasks (Ainslie, 1975; Thaler & Shefrin, 1981; Laibson, 1997; Frederick et al., 2002). The basic nature of hyperbolic discounting, compared to exponential discounting, is that the discount rate is relatively steep at short delays but shallow at long delays. This can account for choice reversal, as detailed below. On the other hand, Schweighofer et al. (2006) advanced an interesting, contrary viewpoint, in which they used a variant of an intertemporal choice task, but with a much shorter time range, and found that discounting is not hyperbolic but exponential. Given these studies, we first ask, regarding internal TD, whether it is possible to exhibit something similar to hyperbolic discounting in conventional time: more specifically, exponential discounting during a short delay but hyperbolic discounting during a long delay in conventional time. We show in section 3.2.2 that this is the case and also that choice reversal occurs.

At first glance, this finding may appear at odds with several fMRI studies investigating discounting functions (Tanaka et al., 2004; McClure, Laibson, Loewenstein, & Cohen, 2004; McClure, Ericson, Laibson, Loewenstein, & Cohen, 2007). Those studies showed that multiple brain areas generally play a role in such a decision and further suggested that different areas play a dominant role in decision making on different timescales: for example, two valuation subsystems, in which one is more involved in immediate reward and the other in delayed reward. This suggestion was made in relation to the view that even when each subsystem uses exponential discounting, hyperbolic discounting may occur if multiple subsystems are involved in the decision (Laibson, 1997; Montague et al., 2006; Fudenberg & Levine, 2006; Berns et al., 2007). They thus examined whether two subsystems, each using exponential discounting, can account for the behavior in an intertemporal choice task better than a single valuation system using exponential discounting (McClure et al., 2007; Glimcher, Kable, & Louie, 2007). In contrast, our findings (shown in section 3.2.2) indicate that under the internal TD formulation, single exponential discounting (in internal time) could induce a hyperbolically discounted behavior (in conventional time), suggesting that multiple subsystems are unnecessary. This is broadly consistent with an alternative view based on experimental findings (Kable & Glimcher, 2007), in which the authors claim that there are no differentially activated multiple systems but instead a single system, over several brain areas, that directly encodes a subjective or hyperbolically discounted value (but see also a recent criticism: Hare, O'Doherty, Camerer, Schultz, & Rangel, 2008). The internal TD might underlie the neural process of the single system.

Nevertheless, we consider that even when the internal TD is the underlying neural mechanism, our finding does not necessarily contradict the possible existence of multiple subsystems in neural valuation (Laibson, 1997; Tanaka et al., 2004; McClure et al., 2004, 2007; Fudenberg & Levine, 2006). To make this point, we address two questions in section 3.2.3, assuming that the internal TD underlies or generates the behavioral data. The first question is whether the summation of two exponential discounting functions produces a better fit to the data than single exponential discounting, as reported in previous experimental studies. Second, we present one way to decompose the internal TD to multiple subsystems in section 3.2.1 and ask whether the subsystems show differential dominant roles at different timescales, as reported in previous experimental studies.

#### 3.2.1. Mathematical Formulation.

*k*(

*f*(

*t*)), which yields the discounting function discussed above. If

*k*(

*f*(

*t*)) behaves approximately like we obtain our desired

*k*(

*f*(

*t*)). Here, it is assumed that the origin, when

*t*is (nearly) equal to zero, is the time of valuation. A good approximation for such a construction (Zhang, 1996) is that given , we have for a sufficiently large

*t*> 0 and for a sufficiently small

*t*<0, and thus and . Hence, we expect that the following discounting function, with appropriate scaling and translating parameters,

*a*and

*b*, behaves similar to equation 3.2. The corresponding internal time

*y*=

*f*(

*t*) is then given by which is used in both sections 3.2.2 and 3.2.3.

*n*subsystems, each of which contributes to a single value function

*V*(

*y*), with a set of weighting functions, , normalized by . Then a single value function

*V*(

*y*) can be decomposed as where the value function of each system

*i*,

*V*(

_{i}*y*), is given by When varies depending on

*y*,

*V*(

_{i}*y*) is expected to contribute

*V*(

*y*) differentially at different timescales. Note that the TD error of each subsystem, denoted by , can be defined by where . Each subsystem can acquire its own value function using this TD error. This subsystem TD error is an interesting quantity in its own right and worthy of future investigation. As a specific example of the decomposition, we use a sort of softmax, letting . In the example in section 3.2.3, we further simplify this equation using

*v*

_{0}for

*n*=2: Below, we used the simplest choice:

*v*

_{0}(

*y*)=

*cy*, where

*c*is a positive constant.

#### 3.2.2. Discount Function and Choice Reversal.

*y*

_{1}indicates the time of reward and

*y*indicates the time of valuation, and below we also use

*y*

_{0}and

*t*

_{0},

*y*

_{0}=

*f*(

*t*

_{0}) as the origin. In this case, the value function becomes equivalent to the discounting function by setting

*R*=1, which is done here.

*V*(

*y*;

*y*

_{1}) is reexpressed as where a specific form of

*f*is chosen based on mathematical considerations (see section 3.2.1).

In Figure 2A, the discounting function is shown (solid line) in conventional time. It is computed using *V*(*f*(*t*_{0}); *f*(*t*_{1})) as a function of *t*_{1}, where the time of origin *t*_{0} is the time of valuation. The discounting function has a steep decay in a short period, approximately exponentially discounted, but a slow decay in a long period, approximately hyperbolically discounted. In Figure 2B, *y*=*f*(*t*) is shown as a solid line; *y* is almost linearly proportional to *t* in a short period, whereas it becomes logarithmic in a long period. In terms of (dashed line), it becomes increasingly small for a large value of *t*. These effects underlie the coappearance of exponential and hyperbolic discounting in different delays in *V*(*y*)=*V*(*f*(*t*)).

Next, we show that choice reversal occurs with this value function in an intertemporal choice task. In this task, the subject is asked to choose one of two options: A or B. Each choice has a pair of reward magnitude and delay, denoted as (*R*, *d _{t}*); for example, choices A and B are set as (

*R*,

*d*)=(10, 50) and (50, 80), respectively (see Figure 2C). In this example, the time delay to the earlier choice (henceforth called the initial delay) is 50. Choice reversal refers to the behavioral phenomenon whereby the subject reverses the choice as the initial delay increases (Ainslie, 1975; Thaler & Shefrin, 1981). In contrast with the above example, consider another example where the initial delay is 0 so that choices A and B are set as (

_{t}*R*,

*d*)=(10, 0) and (50, 30), respectively. Let us call the first and second example the “longer” and “shorter” case, respectively, referring to the length of the initial delay. Suppose the subjects choose A in the shorter case. With exponential discounting (in the

_{t}*t*-system), A can then also be chosen in the longer case, and thus there is no choice reversal due to the shift-invariant property of the exponential discounting (i.e.,

*V*(

*t*;

*t*

_{1})=

*V*(

*t*−

*c*;

*t*

_{1}−

*c*), where and

*c*indicates a time shift in

*t*as a constant). In contrast, hyperbolic discounting can account for choice reversal (see Figure 2C). In this discounting curve, the value of the curve at

*t*is read as the discounted value when

*t*is the time of valuation. Therefore, for the shorter case, we compare the discounted value magnitudes of choices A and B at

*t*=50 and see that A is chosen. For the longer case, at

*t*=0, B is now chosen, and thus the choice is reversed.

The discounting curve of the internal TD value function (see equation 3.4) is shown in conventional time (*t*-system) in Figure 2D. By reading the discounting curve in the same way as in Figure 2C, we see that the choice reversal occurs between the two cases. We show the discounted curves of the same internal TD in internal time (*y*-system) for the longer case in Figure 2E and the shorter case in Figure 2F. The curves are only exponentially discounted in internal time. At the time of evaluation, *y*=0; however, we see that B is chosen in the longer case, whereas A is chosen in the shorter case, indicating a choice reversal. In the *y*-system, the reversal occurs because the delays in the two choices are quite different in internal time between the two cases, although they are constant in conventional time (i.e., 80−50=30−0=30).

*y*-system (see Figures 2E and 2F) is given by

*V*(

*y*;

*y*

_{1}) as a function of

*y*, representing the time of valuation from

*y*

_{0}to

*y*

_{1}(or equivalently as a function of ). For the curve in the

*t*-system (see Figure 2D), first recall that the mapping function

*y*=

*f*(

*t*) is defined by assuming that the time of valuation is the origin. Hence, the curve is given by

*V*(

*f*(

*t*

_{0});

*f*(

*t*

_{1}−

*t*)) as a function of

*t*so that we can directly read the discounted value from Figure 2D when the time of evaluation is at

*t*. The two equations are equivalent, as , where we define such that . In general, however, we have The inset of Figure 2E shows

*V*(

*f*(

*t*);

*f*(

*t*

_{1})), which is clearly different from

*V*(

*f*(

*t*

_{0});

*f*(

*t*

_{1}−

*t*)) in Figure 2D. The inset of Figure 2F shows

*V*(

*f*(

*t*

_{0});

*f*(

*t*

_{1}−

*t*)), which corresponds to the curve in the main panel of Figure 2F and is equivalent to a curve that is obtained by resetting

*t*

_{0}to 50 in Figure 2D. This serves as a good reminder that the time of valuation must be carefully considered when defining

*y*=

*f*(

*t*) (see section 4.2).

#### 3.2.3. Discount Functions with Multiple Systems.

Here we address the two questions stated above, assuming that the internal TD discounting curve underlies behavior. First, can the internal TD be better approximated by two subsystems, each of which follows exponential discounting (in the *t*-system), compared to a single exponential discounting system (in the *t*-system)? To compare the two fits, we used the Akaike information criterion, as the double exponential is a larger-class statistical model of the single exponential. We found that the two subsystems with double exponential discounting fit the internal TD discounting curve significantly better than did the single exponential discounting (Figure 3A, see the legend for details of this fit). Thus, when the internal TD underlies the behavioral data and if the data are examined as in the previous experiments (McClure et al., 2007; Glimcher et al., 2007), the discounting curve of the data could be judged as consistent with double rather than single exponential discounting (McClure et al., 2007; Glimcher et al., 2007).

Second, can the internal TD value function be constructed by multiple subsystems? If so, does each subsystem show a dominant differential role at different timescales, as shown in the previous experiments? The internal TD value function was first decomposed into the value functions of the two subsystems (see Figure 3B; and section 3.2.1), V1 and V2. V1 predominantly contributed to representing the value function with a short delay, whereas V2 predominantly contributed to representing the value function with a long delay (see Figures 3B and 3C). Thus, each subsystem contributed differentially to valuation at different timescales and behaved similar to the previously described multiple subsystems (Laibson, 1997; Tanaka et al., 2004; McClure et al., 2004; Montague et al., 2006; Fudenberg & Levine, 2006; Berns et al., 2007; McClure et al., 2007). On the other hand, neither of the subsystems followed exponential discounting (the inset in Figure 3B), partly because we did not perform any fine-tuning. It would be interesting to see if there is a case of decomposition where each subsystem approximately follows exponential discounting. From another viewpoint, however, it is not necessary to have each subsystem follow exponential discounting, as this theoretically favored property is preserved as a whole internal TD in internal time. The division of the internal TD value functions of subsystems may reflect the different neural properties of each subsystem (Berns et al., 2007). It is noteworthy that each subsystem can possibly learn its value function by directly using its own TD errors (see equation 3.3).

## 4. Further Remarks on Internal TD's Implications

### 4.1. Effect of Internal Time's Noise on TD Error.

If the internal time process is confounded with noise, internal TD behaves differently from conventional TD, even when the units of the two times are the same (). Indeed, recent experimental studies (Fiorillo, Newsome, & Schultz, 2008; Kobayashi & Schultz, 2008) indicate, contrary to the prediction given by the conventional tapped-delay-line TD model, that phasic DA activity, putative TD error, appears at reward onset, as the time interval between a conditioned stimulus (CS) and reward onset (called interstimulus interval, ISI) increases, even when the ISI is fixed for a given CS. Conceptually, such a possibility has already been raised (Gallistel & Gibbon, 2000; Montague & Berns, 2002) with the suggestion that increases in ISI induce more uncertainty in reward prediction. We now consider a situation in which only noise differentiates the *y*-system and *t*-system. We first provide a mathematical formulation and then show simulation results. The purpose of the simulation was to show the basic phenomenon, and we did not fine-tune them.

#### 4.1.1. Mathematical Formulation.

*y*can be uniquely mapped to a conventional time

*t*by

*t*=

*f*

^{−1}(

*y*), but in the presence of noise, this mapping is no longer one-to-one. Let us denote such a noisy internal time by (and also ). In general, we have . If we regard as a function of

*t*, , and let follow a probability density

*P*, , the value function is then given by , which is different from the one without noise. Similarly, TD error (in the operator's view, expressed in the

*t*-system) is given by

*f*be identity, . An additive noise, denoted by , is assumed, given by , where is assumed to be independent and identically distributed (i.i.d.), . The noise occurs in each increment of the discrete step and accumulates over all of the increments. Denoting the number of increments at

*t*by

*i*(

*t*), . The TD error in the observer's view, expressed in the

*t*-system, is thus given by

#### 4.1.2. Simulation Results.

As an example, we simulate a simple Pavlovian task (Fiorillo et al., 2008; Kobayashi & Schultz, 2008) using episodic schedules (see Figure 4A). With noise, the internal time of reward in each trial, denoted by , varies from the actual conventional time of reward *y _{R}*(=

*t*), and this induces TD errors (see Figure 4B). TD error appears at the time of reward, and the magnitude of the error increases as ISI increases; the corresponding TD error at the time of CS decreases. These results match the basic findings of recent studies (Fiorillo et al., 2008; Kobayashi & Schultz, 2008).

_{R}The distribution of is shown in Figure 4C. It is symmetrically unimodal due to the symmetrical noise assumed in the simulation. As the ISI increases and the noise thus accumulates, the distribution of widens, resulting in a larger TD error at the time of reward (see Figure 4B). TD error also has suppressive dips just before and after the time of reward (see Figure 4B). This can be understood as the value function becoming more diffuse around the time of reward as the ISI increases (see Figure 4D), where the value function in the no-noise case is shown by the thick dashed line for comparison.

This simulation result is still primitive but clearly demonstrates that the internal TD model shows phasic DA activity as the ISI increases, even when reward is perfectly timed in conventional time. Temporal imprecision is speculated to be a cause of this DA activity in experimental studies (Fiorillo et al., 2008; Kobayashi & Schultz, 2008), and based on this notion, a mathematical description of this DA activity has been developed (Fiorillo et al., 2008). Related to this, although the units were the same in the two times (Δ *y* = Δ *t*) in this simulation, in reality the internal time unit might also increase at a longer ISI or later in the ISI (Δ *y* > Δ *t*) (Fiorillo et al., 2008). The suppressive dips of DA activity in the simulation have not been experimentally observed, to our knowledge, but a gradual decrease in the baseline DA activity before reward onset has been reported (see Figure 5 of Fiorillo et al., 2008). It would be interesting to extend the internal TD model by taking all of these features into account.

### 4.2. Internal Time Construction.

So far internal time has been introduced statically; *y*=*f*(*t*) is introduced, and is derived accordingly. In this section, we discuss the basic notions for constructing internal time dynamically. In fact, the Pavlovian case in the previous section is such an example, although the only dynamic factor was the noise.

#### 4.2.1. Dynamic Construction of Internal Time.

*f*(

*t*), that is, In this dynamic view, is an active process that can be potentially modulated by several factors (e.g., external events, internal processes, or noise). “Present” time has a special status within all conceptions of time, given that all humans and animals live, by definition, in their present time. At least two components are needed for a dynamic construction: the construction of ongoing time and prospective, or future, time. The former is for constructing internal time concurrently with “present” time, so that

*f*(

*t*) can be constructed only up to or around the present time (denoted by

*t*, or

_{p}*y*=

_{p}*f*(

*t*)). The latter is needed to construct internal time in the prospective use of time, for future time (

_{p}*y*>

*y*or

_{p}*t*>

*t*).

_{p}*t*and are two temporal moments of the present time. Unless the two components of the dynamic internal time construction are exactly the same, we have where

_{p}*t*and are added to

_{p}*f*to indicate the present time used for the internal time construction. Given this, it is conceivable to have This indicates that the internal time anticipated from

*t*to (prospectively) can be different from the internal time elapsed from

_{p}*t*to (i.e., retrospectively) (Roesch & Olson, 2005; Genovesio, Tsujimoto, & Wise, 2006; Doya, 2008), which raises interesting possibilities (see section 5).

_{p}#### 4.2.2. The Present Time and the Origin of Time.

Here we discuss potential relations between the present time and the origin of time. First, we note that if we change the origin, *f* changes. Indeed, the reference frame and origin of time must be defined so as to define the mapping *y*=*f*(*t*) in the first place (here the origin can be nonzero, being the corresponding point between *y* and *t*). There are two ends of a spectrum that can generally be used to define them. At one end is the absolute (time) framework that sets the origin at a time and fixes it forever. Although the absolute framework is possible, its direct application to internal TD seems infeasible. At the other end is the relative (time) framework, which lets the origin change as time goes by. Let us write the relative framework as , where is the time (as defined by the absolute framework) used as the origin of the relative time framework. The simplest example is to set at the present time (defined in the absolute framework), that is, . But in general, does not have to be equal to , so we regard it as (i.e., a function of but left unspecified; see section 5).

*t*and

_{p}*y*) should be expressed in the relative framework , so that it can act as the origin or corresponding point. Then we should examine for the dynamic construction of ongoing time. First, if is constant, we have , where is constant, so that internal TD is essentially the same as conventional TD (see the appendix). Next, there can be two cases where internal TD differs from conventional TD in the dynamic construction, that is, when is no longer constant. The first case is that becomes a function dependent on , when the form of may change over time (conventional time in the absolute framework). The second case is that a function may change. In this case, even if a form of is unchanged, the value of is different as a function of . Taking the two cases together, we have so . Consequently, is now described in the absolute framework, and in other sections,

_{p}*f*(

*t*) can also be understood in this way, although it was originally generated from the function in the relative framework. This dynamically constructed internal time may in principle have a variable duration in conventional time.

A similar argument applies to the dynamic construction of future time. The integral (similar to equation 4.1) may now go from to a future time. This dynamic construction can be done only mentally, in a way perhaps similar to mental time travel (Arzy et al., 2008; Boyer, 2008) or constructing a representation of future events (Szpunar et al., 2007; Liberman & Trope, 2008). Alternatively, the integral may be given in a static way, perhaps purified as a representation of prospective time by repeated experience of imagining the future. Either way, the unit of this internal time for future time may also have a variable duration in conventional time.

### 4.3. Internal Time Modulation and Serotonin's Internal Time Hypothesis.

Here we summarize the effect of internal time modulation on neural valuation and then discuss the internal time hypothesis of serotonergic neuronal function.

#### 4.3.1. Internal Time Modulation.

Internal time modulation can be expressed through a change or modulation in *f*(*t*) (long timescale) or equivalently in (short timescale). For a long timescale, we can generally state that the value function *V*(*f*(*t*)) is discounted more heavily at time *t* when internal time goes faster (that is, *f*(*t*) is larger; see Figure 5A).

Internal time modulation expressed as a change in illustrates a short-timescale property. It changes the duration in the *t*-system, , in the operator's view or, interchangeably, changes the discount factor in the observer's view (see Table 1 and the appendix). When internal time goes faster ( is larger), more units of the process are needed for the same time period of the *t*-system (see Figure 5B, a) and, interchangeably, the discounting is stronger ( is smaller, which is partly dependent on the time constant τ; see Figure 5B, b). Suppose that is modulated online in an experimental trial. Consider modulating to be smaller while waiting for the reward, that is, while knowing that reward will eventually come but finding that it has yet to come (in the *t*-system); the modulated smaller leads to a longer unit of process in the *t*-system (a larger Δ *t*(*y*)) so that the internal TD can still reside in the same time step rather than advancing to the next step. Such a modulation makes the TD model's valuation more resistant to temporal variation of rewards and events. This simple example, although oversimplified (see section 5), reveals the advantage of online internal time modulation.

#### 4.3.2. Internal Time Hypothesis of Serotonin Functions.

Here, we propose an internal time hypothesis of serotonin function, which suggests that serotonin neurons modulate internal time in neural valuation. A more restricted version of the hypothesis states that an increase in serotonin neural activity makes internal time go more slowly (leading to a smaller *f*(*t*) and ). Admittedly it is nearly impossible to describe all of the serotonin functions with a single theory (Daw, Kakade, & Dayan, 2002), given the extremely complicated nature of serotonin activities (Jacobs & Azmitia, 1992; Buhot, 1997; Sershen, Hashim, & Lajtha, 2000; Cardinal, 2006; Cools, Roberts, & Robbins, 2008). Nevertheless, a computational theory would help us to clarify our normative understanding, and in this spirit, several hypotheses regarding serotonin function have been proposed in relation to TD models (Daw et al., 2002; Doya, 2002; Dayan & Huys, 2008). The internal time hypothesis follows this approach, largely inspired by the discount factor hypothesis (Doya, 2002). This hypothesis largely bases its argument on the experimental observations that the tendency to make impulsive decisions decreases when serotonin activity is higher (Mobini et al., 2000; Denk et al., 2005; Winstanley, Eagle, & Robbins, 2006; Schweighofer et al., 2008), and it thus proposes, based on conventional discrete-time TD, that an increase in serotonin neural activity leads to an increase in the discount factor, thereby reducing impulsive decisions.

When *f*(*t*) is restricted to be linear, internal and conventional TDs become largely equivalent, except for issues in interchangeability between the unit and the discount factor (see the appendix), and this makes the internal time hypothesis (of the restricted version) equivalent to the discount factor hypothesis, at least in the observer's view. In this view, the internal time hypothesis predicts that the elevated serotonin activity makes the discount factor () larger in the same manner as does the discount factor hypothesis. In contrast, under the operator's view, the internal time hypothesis predicts that the elevated serotonin activity makes the unit duration in the *t*-system (Δ *t*(*y*)) larger. Indeed, this is consistent with several experimental observations that putative serotonin neural activity is tonically elevated while waiting for reward (Nakamura, Matsumoto, & Hikosaka, 2008; Miyazaki, Miyazaki, & Doya, 2007; Mizuhiki, Inaba, Toda, Ozaki, & Shidara, 2008). Behaviorally, not exponential but hyperbolic temporal discounting is often observed with normal subjects, implying that *f*(*t*) is nonlinear (see section 3.2); thus, the nonlinearity effect of *f*(*t*) is important for understanding the effects of serotonin activity on impulsivity. Such impulsivity might be due to a differential modulation of time at different time delays (Wittmann & Paulus, 2008), implying that time duration (in the operator's view) or discount factor (in the observer's view) is modulated differentially at different delays. Furthermore, this differential modulation may function differentially in different circuits, such as, different cortico-basal ganglia loops (Nakahara et al., 2001; Doya, 2002).

There is also an interesting connection of our hypothesis to another recent hypothesis (Dayan & Huys, 2008) suggesting that elevated serotonin activity prunes a decision tree specifically for negatively estimated outcome states and choices. Our hypothesis is closely related to this (except for the specificity for negative outcomes) if we move from a cache system (discussed as a TD model) to a decision tree system (as used in model-based TD learning), that is, if the internal time modulated by serotonin activity is regarded as the depth of a tree search in valuation rather than as the time unit of a cache system.

Finally, several caveats must be mentioned. First, a hypothesis of a single role for serotonin activity in decision making may be misleading if it is taken as excluding other functions, given the rich variety of serotonin functions. Other serotonin neurons appear to show rather tonically suppressed activity while waiting for reward (Nakamura et al., 2008). They also respond diversely to external events or behavioral correlates as recently reported (Ranade & Mainen, 2009). Second, some literature suggests that other neural modulators, such as DA and cholinergic activities, rather than serotonergic activities, are more involved in subjective interval timing (Buhusi & Meck, 2005; Matell, Bateson, & Meck, 2006; Meck, 1996). As discussed in section 1, it is not yet clear how we should map those findings onto the current internal TD formulation. Further studies are required to develop TD formulations that can explicitly include a subjective timing system with regard to the time of neural valuation processes and then to examine those findings in the TD formulations. Related to this issue, our hypothesized internal time modulation by serotonin neural activities can be viewed as a consequential rather than a causal effect (Ranade & Mainen, 2009; Ho, Velázquez-Martínez, Bradshaw, & Szabadi, 2002), as those activities may directly affect the temporal progression of neural activities that represent states in TD models. This possibility is of particular interest for future studies (see section 5.5).

## 5. Discussion

In this study, we have investigated the internal-time TD formulation that uses the operator's time, distinct from the observer's time, to construct a continuous-time value function. We called the operator's time internal time and the observer's time (the time usually used in experiments) conventional time. We focused on formulating an internal-time TD framework and investigating its consequences to better understand neural value-based decision making.

### 5.1. Operator-Observer Problem.

The internal TD formulation explicitly deals with the operator-observer problem, whether the TD model is used for modeling the operation processes of neural valuation or for providing a descriptive model of the processes in the observer's view (see section 2.4). The same issue regarding differential modeling in the operator's and the observer's views also applies to other types of TD algorithms (e.g., algorithms using action-value functions) if the time system of the operator is different from that of the observer. More broadly, it also applies to other reinforcement learning algorithms, as treating rewards with time delays is often central to these algorithms, and it also arises in processes other than neural valuation when the operator's and the observer's time are different. The approach taken in this study is thus potentially applicable to modeling many neural processes and, in a broader context, belongs to studies examining the effects induced by linking different time systems (see e.g., Acebron, Bonilla, Vicente, Ritort, & Spigler, 2005).

In neural valuation, the operator-observer problem does not have to be considered if the units of the operator and observer times (Δ *y* and Δ *t*) are always the same (with a fixed linear constant). In this case, the operations in that valuation are equivalent for the two views. Or if we assume that the two units are equivalent, conventional TD (more precisely, equally dissected conventional TD) can be used.

When the two units are different in some way, the operator and observer models are different in principle. If we choose to model a given process in the observer's view, the discrete conventional time unit can still be used. In this case, note that the discrete-time TD error in the observer's view is expressed using a discount factor that is variable based on its operation processes (see Table 1). This has several implications. For example, when we observe a DA response (putative TD error) in the discrete conventional time unit, it might be better modeled using variable discount factors. In this regard, it is interesting to note that DA activity was found to follow not exponential but hyperbolic discounting as the ISI increased (Kobayashi & Schultz, 2008). Also, an observation with a conventional time unit (Δ *t*) may correspond to observing multiple iterations of the operator's states concurrently (if ) or the same state repeatedly (if ). The conventional equally dissected TD model, which has been extensively applied to neural valuation, assumes that the unit of the neural operation is fixed in conventional time. This assumption might reasonably hold in cases when the environment is quite stable and the time range (e.g., that of the ISI) is quite short.

### 5.2. Temporal Discounting.

Temporal discounting represents the long-timescale property of TD models (in case of discounted rewards). The internal TD preserves a theoretically favored discounting (exponential discounting) in internal time, but this discounting is expressed differently in conventional time, when internal time is different from conventional time. From this perspective, traditional debates over either exponential or hyperbolic discounting using conventional time should be understood as based on the choice of the specific time system and thus on the observer's view. Describing the discounting in the observer's view has many advantages. We can certainly ask if subjects behave “rationally” (exponentially) or “subjectively” (hyperbolically) in the observer's view. It should not, however, be taken for granted as a description from the operator's viewpoint, because the description depends on what time system is used in the operation process. It is possible that the subjects are evaluating “rationally” (i.e., exponentially discounting in internal time) but can still be observed as evaluating “subjectively” (hyperbolically discounting in conventional time), as we demonstrated with the occurrence of choice reversal (see section 3.2.2). Furthermore, it is also possible that the subjects may have exponential and hyperbolic discounting at different delays when observed in conventional time, even though they perform only exponential discounting in internal time.

For a hyperbolically discounted behavior, it has been debated whether there is only a single underlying neural system that directly encodes a hyperbolically discounted value (Kable & Glimcher, 2007) or multiple (or two) subsystems (e.g., a cognitive versus emotional, or rational versus subjective, system), whereby at least one system is usually regarded as obeying exponential discounting (Laibson, 1997; McClure et al., 2004, 2007; Montague et al., 2006; Fudenberg & Levine, 2006; Berns et al., 2007). As the internal TD could induce the choice reversal using only single exponential discounting (in internal time), it is broadly consistent with the single system theory. We also showed in section 3.2.3, however, that, together with the decomposition of the internal TD to two subsystems, the internal TD's discounting curve could still be consistent with the multiple system's theory based on the previous experimental studies. In the decomposition, only the whole system preserves exponential discounting in internal time, but each subsystem need not use exponential discounting (in either internal or conventional time). These notions call for future experimental examinations in several directions, for example, whether the internal TD underlies “the” single system, or whether it is decomposed into multiple subsystems, neither of which obeys exponential discounting in conventional time.

The form of *f*(*t*) used in the demonstration of choice reversal was derived purely based on mathematical intuition; we do not have any direct experimental evidence supporting the use of *f*(*t*). Interestingly, though, a similar form of *f*(*t*) was recently suggested by Peter Shizgal and his colleagues in their research on opportunity cost (personal communication, November 2008; Solomon, Conover, & Shizgal, 2007). A pressing experimental question will be to examine what form of *f* the neural valuation's operation process really has. In such experiments, the time range must be considered carefully. In tasks like an intertemporal choice task, the time range can often be quite long—hours or months. The internal time construction can be fundamentally different in the ongoing and prospective use of time (see section 4.2), although there may be no clear-cut boundary between the two uses. For instance, prospection of a few seconds is perhaps more appropriately grouped with the ongoing use of time, whereas prospection of hours is definitely best grouped with the prospective use of time (Szpunar et al., 2007; Buckner & Carroll, 2007; Arzy et al., 2008). Separation of such a prospective time from ongoing time also implies that neural valuation can be free from conventional time, particularly in that time range. The relationship of time perception at such a timescale (e.g., logarithmic time perception) to types of discounting has been proposed (Takahashi, 2005; Gilbert & Wilson, 2007; Boyer, 2008; Wittmann & Paulus, 2008). Such a future time may be constructed either dynamically or statically (see section 4.2), but how it is done in neural systems specifically for discounting remains to be determined (Luhmann, Chun, Yi, Lee, & Wang, 2008). Future experiments should address these issues by explicitly probing the internal time construction with respect to the nature of discounting. Apart from the range of the timescale, anticipatory time and elapsed time can be different in the internal time construction, even when they are the same in conventional time (see section 4.2.1). Such a difference may affect neural valuation in both the ongoing and prospective use of time (Roesch & Olson, 2005; Genovesio et al., 2006), and it is also conceivable to apply an internal TD to valuation in retrospection, using the time construction for the time of the past, which is also worthy of future investigation (Schacter, Addis, & Buckner, 2007; La Camera & Richmond, 2008; Dayan, 2009).

This study demonstrated only one case of an internal TD, which has exponential and hyperbolic discounting during short and long delays, respectively. Certainly it is possible to construct a much wider variety of internal TDs, and thus the question about the form of *f* should be examined, combining both experimental and theoretical works. At any rate, discounting for a temporally distant reward is a more complicated issue than can be covered within the scope of this study (Loewenstein & Prelec, 1992; Frederick et al., 2002; Luhmann, Chun, Yi, Lee, & Wang, 2008). The time unit of discounting does not appear to be invariant but rather to be dependent on the task (Frederick et al., 2002; McClure et al., 2007), and several different factors have been suggested to differentially influence the mechanisms involved in the discounting process, depending on the task (Loewenstein & Prelec, 1992; Frederick et al., 2002; Rubinstein, 2003; Berns et al., 2007). To fully incorporate these factors into the internal TD formulation, the extensions of the formulation are required (see section 5.5).

### 5.3. Internal Time Construction and Modulation.

Results from a wide range of experiments (see section 1) are broadly consistent with the notion that the unit of time specifically used for neural valuation (internal time) is different from that of conventional time. Nevertheless, it is currently unclear how internal time is constructed. Presumably internal time is constructed or affected by several different neural time systems. Even in Pavlovian conditioning, different systems with regard to different timescales might be differentially involved, depending on the ISI. In terms of modality, the systems used for perception time (e.g., for estimating a time interval) are likely involved, but the systems for decision or motor time (e.g., taking Pavlovian action or instrumental action in instrumental tasks) might be partially involved.

The modulation of internal time is limited. Under ideal conditions, internal time would be perfectly adjusted in the presence of a variable ISI until a reward was obtained. But this is not the case, even with a relatively small range of variable ISIs (Fiorillo et al., 2008). Neurons in some areas modulate their activity by reflecting an event's temporal probability distribution (Ghose & Maunsell, 2002; Leon & Shadlen, 2003; Janssen & Shadlen, 2005) (but an examination of this issue for DA activity indicated no such modulation, although with caveats, Fiorillo et al., 2008). These neural activities might modulate internal-time units and thus affect valuation, which is possibly related to issues such as motivation or opportunity cost (Wise, 2004; Niv, Daw, & Dayan, 2006).

### 5.4. Current Internal TD Formulation.

Although it certainly addresses the dissociation between internal and conventional time, we consider the current internal TD formulation to still be rather primitive. First, we formulated the internal TD only in the limited case of a TD(0) algorithm of the TD (λ) family for the case of discounted rewards (Sutton & Barto, 1998). Several previous proposals to extend discrete-time TD models may also be used for extending internal TD formulations, such as the TD(λ) family, as well as the model-based/multi-timescale or partially observable approach (Sutton, 1995; Sutton & Barto, 1998; Doya, Samejima, Katagiri, & Kawato, 2002; Daw, Niv, & Dayan, 2005; Daw et al., 2006). Also, we did not deal with the case of undiscounted, or averaged, rewards (Schwartz, 1993; Puterman, 1994; Mahadevan, 1996). A popular algorithm in this case, often called average TD, has a value function called a bias, or relative, value (Mahadevan, 1996) that uses a term, average reward estimated over time, in addition to terms of temporal difference of reward and value. The average reward is affected by how much time has passed. Thus, it is also affected by the distinction between internal and conventional time, for example, if it is measured using internal time rather than conventional time, although the effect of the distinction on the discount factor does not exist in this case. We are interested in examining the effects of the internal average TD on this issue. This would also help clarify the relation of the internal TD formulation to the semi-Markov approach using the conventional average TD (Daw et al., 2006). Furthermore, in this study, we dealt only with the state-value functions for clarity of exposition. We did not make an explicit connection to actions that would cause state transitions. A future promising avenue is to extend internal TD formulations to the case involving action-value functions or state-value functions with an explicit action selection mechanism (e.g., actor-critic algorithm). For example, under internal TD formulations, it is intriguing to include an action that would reset the origin of the time (see section 5.5). The issues relating to actions must be examined together with the issue discussed below.

Second, in the current internal TD formulation, we chose to preserve the original theoretical formulation of a discrete-time TD, where the number of discrete-time steps (state iterations) is directly linked to the number of iterations of the (constant) discount factor (Sutton & Barto, 1998). Note that in the perspective of MDPs, states must first be defined before building discrete-time TD models, and a discrete-time step is then an auxiliary variable of the states. But we have not discussed states so far, primarily because this study focuses on the effects of dissociating the operator's and the observer's times in the TD formulation. When the original principle is applied to connecting continuous-time to discrete-time internal TD formulations, a rigid relationship between the two times is maintained, and the discrete-time unit plays a dual role as a unit for both operation and duration (which is also true for the case of conventional-time TD as equally dissected conventional TD). Thus, in the current TD formulation, the equally dissected model in internal time is the operator's discrete internal TD model. The rigid relationship further implies that the unit of a neural operational process, equivalent to the discrete internal time unit (or the duration), also corresponds to a state. Accordingly, the rigid relationship is also maintained with states. In contrast, it is often the case that the conventional TD (equally dissected conventional TD) specifies states only as external events (e.g., CS and US in Pavlovian conditioning), and event-triggered representations then act as placeholders of time durations between the events, provided that Markov properties are assumed with respect to the events. Because of this assumption, times between events, although not called states, can still effectively act as states under MDPs for iterations of TD operations. This approach is often a useful simplification for constructing a succinct TD model for neural valuation. It will not be sufficient for modeling, however, if the Markov assumption does not hold (e.g., Nakahara et al., 2004) or the neural valuation process does not use a discrete conventional unit. In these cases, we must address the nature of states, even at times between events, regardless of what approach we take, but the current internal TD is one approach that addresses this issue.

The equally dissected internal TD model can accommodate variable durations in conventional time (see Table 1), and the event-dissected conventional TD model is a special case of this internal TD model, where a discrete internal time unit is specifically assumed to be the duration between external events. Since the TD hypothesis of DA activity was proposed, it is puzzling that there has been no convincing experimental evidence indicating that DA responses (putative TD error) propagate backward continuously over conventional-time units from the time of reward onset to the time of the CS, as the number of trials increases (Pan et al., 2005). The internal TD approach appears promising for dealing with this issue because the TD error is propagated over iterations of internal-time units.

### 5.5. Internal Time, Subjective Timing, and State.

As evident in the discussion, many issues remain to be investigated to extend the internal TD formulation. First, the question remains whether the rigid relationship among states, associated durations, and discount factors should always be maintained in the modeling of neural value-based decision making. Consider temporal discounting as an example. As discussed, a variety of factors can affect the nature of discounting. Some factors may be best understood in the context of their effects on internal time, for which application of the current internal TD would be promising, but other mechanisms may be required to deal with other factors. What is more fundamental, given the distinction between the operator's and the observer's times, is that as long as we attempt to model the operational process, we must identify the operator's unit, and then other parameters, such as a discount factor, must be determined with respect to that unit. Any change in the parameters must also be evaluated with respect to the operator's unit; that is, if the discount factor becomes variable, it should be modeled accordingly with respect to the unit. For example, the discount factor can be modulated by some factors (e.g., a function of duration in conventional time), and such a case would be a kind of semi-Markov internal TD. In this view, the semi-Markov conventional TD (see equation 3.1) belongs to this class in that it specifies duration as (probabilistically) equally dissected in internal time according to external events (event-dissected in conventional time) but the discount factor as an exponential function of duration in conventional time.

Also remaining to be investigated is how the origin of internal time is actually set in neural systems, especially regarding the ongoing use of time. We suspect that the origin of time often shifts with the ongoing time, but may also be locked to externally salient events. Furthermore, it is also conceivable that the origin may change, triggered by statistical inferences made by other neural systems. With such origins, the progression (and modulation) of internal time may be affected by subsequent events or factors involved in task situations, such as context, hidden states, uncertainty, time estimation, and event anticipation (Suri & Schultz, 2001; Nakahara et al., 2004; Daw et al., 2005; Janssen & Shadlen, 2005; Preuschoff, Bossaerts, & Quartz, 2006; Morris, Nevet, Arkadir, Vaadia, & Bergman, 2006; Yamada, Matsumoto, & Kimura, 2007; Kubota et al., 2009).

The construction and modulation of internal time is intimately related to the construction and modulation of states, or to state transition in neural systems. Discussions of time construction and modulation can be directly translated to those of states. For example, possible differential contributions of perception and motor time to the construction of time can be translated to the question of how neural perception and motor functions contribute to or affect neural correlates that would correspond to states in the TD formalism. More generally, states in neural valuation would ultimately be a pattern of neural activity. Several proposals have been made for understanding the roles of DA activity in neural valuation from perspectives often quite different from that of a normative TD framework (Brown, Bullock, & Grossberg, 1999; Redgrave & Gurney, 2006; O'Reilly, Frank, Hazy, & Watz, 2007; Izhikevich, 2007; Tan & Bullock, 2008; Potjans, Morrison, & Diesmann, 2009; Soltani & Wang, 2010; Hazy, Frank, & O'Reilly, 2010). They provide their accounts more in relation to patterns of neural activity. Future research is needed to investigate how extensively these ideas can be integrated with the viewpoint of the internal TD formulation. We hope that in the long run, the current internal TD formulation helps to build a closer link of these types of accounts with computational-level accounts, usually provided by TD models. Neural correlates of valuation are important clues for future experiments regarding internal time units and states; neural correlates of values must be exponentially discounted with respect to iterations of internal-time units. This characteristic can be used in experimental probes; by manipulating experimental conditions, we can inspect how similarly the neural correlates and behavioral valuation change over those conditions. Related to these issues, a recent proposal in TD formulation suggests that external events provoke a set of temporally progressive inputs to the TD value function, as a more sophisticated temporal stimulus representation (Ludvig, Sutton, & Kehoe, 2008). Time is essentially embedded in the temporal progression (Grossberg & Schmajuk, 1989), which can also be considered as states progressing over time. Then, if some explicit mechanisms were to adjust this temporal progression, they would act as changing states and thereby also influence valuation.

Finally, such an additional modulating mechanism, together with mechanisms setting the origin of time in ongoing use, appears very promising for addressing a critical issue mentioned in section 1: How should subjective timing (timing behavior or subjective interval estimation) be mapped onto the internal TD formulation? Daw et al. (2006) made a step in this direction using a semi-Markov conventional TD formulation (a class of the internal TD model, whereby subjective timing was modeled as a probabilistic inference of durations between external events), further combined with partial observability as well as use of average TD. The internal TD formulation helps to carry these ideas further and makes a further connection to operational processes of neural valuation, as the dissociation between internal and conventional time makes it possible to explicitly model the effects of subjective timing in a more flexible manner. Subjective timing may directly influence the progression of internal time units, act as modulating or switching states (which are not necessarily locked with external events), and/or have a direct influence at the level of valuation or action selection (e.g., changing the so-called temperature parameter in the soft max function for the selection). The interdependence and differential contributions of different attributes (e.g., motivational and temporal) for learning associations between CS and US have been studied in Pavlovian conditioning (Delamater & Oakeshott, 2007). Changes in the internal time units might be differentially affected by what time systems (e.g. perception or motor time) are acting at a given moment. A rich literature on interval timing provides important clues toward resolving these issues (Gallistel & Gibbon, 2002). Conflicting views exist, however, about several of these issues, for example, whether there is a dedicated interval timing system and, if so, whether it is composed of a single system or multiple systems (Gibbon et al., 1997; Killeen, Fetterman, & Bizo, 1997; Zeiler, 1998; Dragoi et al., 2003; Buhusi & Meck, 2005); whether the property known as Weber's law or the scalar property holds and, if so, how (Gibbon, 1977; Grondin, 2001); and whether the passage of subjective time is linear or nonlinear to conventional (objective) time in relation to different experimental paradigms (Gibbon & Church, 1981; Staddon & Higa, 1999; Cerutti & Staddon, 2004). Also different models and mechanisms have been proposed for subjective interval timing (Gibbon, 1977; Church, 1984; Killeen & Fetterman, 1988; Staddon & Higa, 1999; Gallistel & Gibbon, 2000; Dragoi et al., 2003; Buhusi & Meck, 2005). Given these considerations, although extending the internal TD formulation to include issues of interval timing is of great interest, it may be best addressed by first explicitly formulating a simple modulatory effect of subjective timing progression on internal time units and states.

## Appendix: Relation of Conventional and Internal TD When *f*(*t*) Is Linear

*f*is linear,

*y*=

*f*(

*t*)=

*ct*(

*c*is a positive constant) with a single reward. The value function is given by First, internal and conventional TDs are essentially equivalent in this case.

*V*(

*y*)=

*V*(

*f*(

*t*)) can be understood as a conventional TD with the unit Δ

*t*and the constant discount factor , where . Another way to regard

*V*(

*y*) as a conventional TD is to note that

*y*can be viewed only as a rescaled conventional time

*t*, and

*V*(

*y*) is the rescaled conventional TD with the unit Δ

*y*and the constant discount factor . Second, interchangeability between the discrete time step and the discount factor arises once the hiding convention in the discrete-time formulation is used. Consider a period

*T*to be modeled with conventional TD. The number of discrete steps is and with the first and second interpretations above, respectively; thus, we have

*n*

_{1}=

*cn*

_{2}. When we do not distinguish between Δ

*t*and Δ

*y*(denoting both by ) as with the hiding convention, the discount factor is

or , respectively, so we have . Thus, the same value function (see equation A.1) can be interpreted interchangeably with either *n*_{1} and γ_{1} or *n*_{2} and γ_{2}.

## Acknowledgments

We are grateful to S. Amari, K. Morita, and S. Suzuki for their comments on an early draft of the manuscript; to P. Bossaerts for his comments on choice reversal during the development of this work; and to J. Teramae for his comments on some citations. We are also very grateful for the reviewers' insightful comments. This study was partially supported by JSPS KAKENHI grant 21300129 and MEXT KAKENHI grant 20020034.