Abstract

In a constantly changing world, animals must account for environmental volatility when making decisions. To appropriately discount older, irrelevant information, they need to learn the rate at which the environment changes. We develop an ideal observer model capable of inferring the present state of the environment along with its rate of change. Key to this computation is an update of the posterior probability of all possible change point counts. This computation can be challenging, as the number of possibilities grows rapidly with time. However, we show how the computations can be simplified in the continuum limit by a moment closure approximation. The resulting low-dimensional system can be used to infer the environmental state and change rate with accuracy comparable to the ideal observer. The approximate computations can be performed by a neural network model via a rate-correlation-based plasticity rule. We thus show how optimal observers accumulate evidence in changing environments and map this computation to reduced models that perform inference using plausible neural mechanisms.

1  Introduction

Animals continuously make decisions in order to find food, identify mates, and avoid predators. However, the world is seldom static. Information that was critical yesterday may be of little value now. Thus, when accumulating evidence to decide on a course of action, animals weight new evidence more strongly than old (Pearson, Heilbronner, Barack, Hayden, & Platt, 2011). The rate at which the world changes determines the rate at which an individual should discount previous information (Deneve, 2008; Veliz-Cuba, Kilpatrick, & Josić, 2016). For instance, when actively tracking prey, a predator may use visual information obtained only within the last second (Olberg, Worthington, & Venator, 2000; Portugues & Engert, 2009), while social insect colonies integrate evidence that can be hours to days old when deciding on a new home site (Franks, Pratt, Mallon, Britton, & Sumpter, 2002). These environmental state variables (e.g., prey location, best home site) are constantly changing, and the timescale on which the environment changes is unlikely to be known in advance. Thus, to make accurate decisions, animals must learn how rapidly their environment changes (Wilson, Nassar, & Gold, 2010).

Evidence accumulators are often used to model decision processes in static and fluctuating environments (Smith & Ratcliff, 2004; Bogacz, Brown, Moehlis, Holmes, & Cohen, 2006). These models show how noisy observations can be accumulated to provide a probability that one among multiple alternatives is correct (Gold & Shadlen, 2007; Beck et al., 2008). They explain a variety of behavioral data (Ratcliff & McKoon, 2008; Brunton, Botvinick, & Brody, 2013), and electrophysiological recordings suggest that neural activity can reflect the accumulation of evidence (Huk & Shadlen, 2005; Kira, Yang, & Shadlen, 2015). Since normative evidence accumulation models determine the belief of an ideal observer, they also show the best way to integrate noisy sensory measurements and can tell us if and how animals fail to use such information optimally (Bogacz et al., 2006; Beck et al., 2008).

Early decision-making models focused on decisions between two choices in a static environment (Wald & Wolfowitz, 1948; Gold & Shadlen, 2007). Recent studies have extended this work to more ecologically relevant situations, including multiple alternatives (Churchland, Kiani, & Shadlen, 2008; Krajbich & Rangel, 2011), multidimensional environments (Niv, Daniel, Geana, Gershman, Leong, & Radulescu, 2015), and cases where the correct choice (McGuire, Nassar, Gold, & Kable, 2014; Glaze, Kable, & Gold, 2015) or context (Shvartsman, Srivastava, & Cohen, 2015) changes in time. In these cases, normative models are more difficult to derive and analyze (Wilson & Niv, 2011), and their dynamics are more complex. However, methods of sequential and stochastic analysis are still useful in understanding their properties (Wilson et al., 2010; Veliz-Cuba et al., 2016).

We examine the case of a changing environment where an optimal observer discounts prior evidence at a rate determined by environmental volatility. In this work, a model performs optimally if it maximizes the likelihood of predicting the correct environmental state, given the noise in observations (Bogacz et al., 2006). Experiments suggest that humans learn the rate of environmental fluctuations to make choices nearly optimally (Glaze et al., 2015). During dynamic foraging experiments where the choice with the highest reward changes in time, monkeys also appear to use an evidence discounting strategy suited to the environmental change rate (Sugrue, Corrado, & Newsome, 2004).

However, most previous models have assumed that the rate of change of the environment is known ahead of time to the observer (Glaze et al., 2015; Veliz-Cuba et al., 2016). Wilson et al. (2010) developed a model of an observer that infers the rate of environmental change from observations. To do so, the observer computes a joint posterior probability of the state of the environment, the time since the last change in the environment, and a count of the number of times the environment has changed (the change point count). With more measurements, such observers improve their estimates of the change rate and are therefore better able to predict the environmental state. Inference of the change rate is most important when an observer makes fairly noisy measurements and cannot determine the current state from a single observation.

We extend previous accumulator models of decision making to the case of multiple, discrete choices with asymmetric, unknown transition rates between them. We assume that the observer is primarily interested in the current state of the environment, often referred to as the correct choice in decision-making models (Bogacz et al., 2006). Therefore, we show how an ideal observer can use sensory evidence to infer the rates at which the environment transitions between states and simultaneously use these inferred rates to discount old evidence and determine the present environmental state.

Related models have been studied (Wilson et al., 2010; Adams & MacKay, 2007). However, they relied on the assumption that after a change, the new state does not depend on the previous state. This excludes the possibility of a finite number of states. For example, in the case of two choices, knowledge of the present state determines with complete certainty the state after a change, and the two are thus not independent. For cases with a finite number of choices, our algorithm is simpler than previous ones. The observer only needs to compute a joint probability of the environmental state and the change point count.

The storage needed to implement our algorithms grows rapidly with the number of possible environmental states. However, we show that moment closure methods can be used to decrease the needed storage considerably, albeit at the expense of accuracy and the representation of higher-order statistics. Nonetheless, when measurement noise is not too large, these approximations can be used to estimate the most likely transition rate and the current state of the environment. This motivates a physiologically plausible neural implementation for the present computation. We show that a Hebbian learning rule that shapes interactions between multiple neural populations representing the different choices allows a network to integrate inputs nearly optimally. Our work therefore links statistical principles for optimal inference with stochastic neural rate models that can adapt to the environmental volatility to make near-optimal decisions in a changing environment.

2  Optimal Evidence Accumulation for Known Transition Rates

We start by revisiting the problem of inferring the current state of the environment from a sequence of noisy observations. We assume that the number of states is finite and the state of the environment changes at times unknown to the observer. We first review the case when the rate of these changes is known to the observer. In later sections, we assume that these rates must also be learned. Following Veliz-Cuba et al. (2016), we derived a recursive equation for the likelihoods of the different states and an approximating stochastic differential equation (SDE). Similar derivations were presented for decisions between two choices by Deneve (2008) and Glaze et al. (2015).

An ideal observer decides between choices, based on successive observations at times . We denote each possible choice by , , with being the correct choice at time . The transition rates , , correspond to the known probabilities that the state changes between two observations: . The observer makes measurements, at times with known conditional probability densities . Here, and elsewhere, we assume that the observations are conditionally independent. We also abuse notation slightly by using to denote a probability, or the value of a probability density function, depending on the argument. We use explicit notation for the probability density function when there is a potential for confusion.

We denote by the vector of observations and by the conditional probability . To make a decision, the observer can compute the index that maximizes the posterior probability, . Therefore, is the most probable state, given the observations .

A recursive equation for the update of each of the probabilities after the th observation has the form (Veliz-Cuba et al., 2016)
formula
2.1
Thus, the transition rates, provide the weights of the previous probabilities in the update equation. Unless transition rates are large or observations very noisy, the probability grows and can be used to identify the present environmental state. However, with positive transition rates, the posterior probabilities tend to saturate at a value below unity. Strong observational evidence that contradicts an observer’s current belief can cause the observer to change belief subsequently. Such contradictory evidence typically arrives after a change in the environment.
Following Veliz-Cuba et al. (2016), we take logarithms, and denote by the change in log probability due to an observation at time . Finally, we assume the time between observations is small, and for , so that dropping higher-order terms yields
formula
where the likelihood function may vary with . Next, we use the approximation for and replace the index by time, to write
formula
where the drift is the expectation of over , conditioned on the true state of the environment at time , , and follows a multivariate gaussian distribution with mean zero and covariance matrix given by
formula
Finally, taking the limit , we can approximate the discrete process, equation 2.1, with the system of SDEs:
formula
2.2
where we assume the following limits hold:
formula

The nonlinear term in equation 2.2 implies that in the absence of noise, the system has a stable fixed point and older evidence is discounted. Such continuum models of evidence accumulation are useful because they are amenable to the methods of stochastic analysis (Bogacz et al., 2006). Linearization of the SDE provides insights into the system’s local dynamics (Glaze et al., 2015; Veliz-Cuba et al., 2016) and can be used to implement the inference process in model neural networks (Bogacz et al., 2006; Veliz-Cuba et al., 2016).

We next extend this approach to the case when the observer infers the transition rates, from measurements.

3  Environments with Symmetric Transition Rates

We first derive the ideal observer model when the unknown transition rates are symmetric, when , and . This simplifies the derivation, since the observer only needs to estimate a single change-point count. The asymmetric case discussed in section 4 follows the same idea, but the derivation is more involved since the observer must estimate multiple counts.

Our problem differs from previous studies in two key ways (Adams & MacKay, 2007; Wilson et al., 2010): First, we assume the observer tries to identify the most likely state of the environment at time . To do so, the observer computes the joint conditional probability, of the current state, , and the number of environmental changes, , since beginning the observations. Previous studies focused on obtaining the predictive distribution, . The two distributions are closely related, as .

Second, and more important, Adams and Mackay (2007) and Wilson et al. (2010) implicitly assumed that only observations since the last change point provide information about the current environmental state. That is, if the time since the last change point—the current run length, —is known to the observer, then all observations before that time can be discarded:
formula
This follows from the assumption that the state after a change is conditionally independent of the state that preceded it. We assume that the number of environmental states is finite. Hence, this independence assumption does not hold. Intuitively, if observations prior to a change point indicate the true state is , then states are more likely after the change point.

Adams and Mackay (2007) and Wilson et al. (2010) derive a probability update equation for the run length and the number of change points and use this equation to obtain the predictive distribution of future observations. We show that it is not necessary to compute run-length probabilities when the number of environmental states is finite. Instead we derive a recursive equation for the joint probability of the current state, , and number of change points, . As a result, the total number of possible pairs grows as (linearly in ) where is the fixed number of environmental states , rather than (quadratically in ) as in Wilson et al. (2010).1

3.1  Symmetric Two-State Process

We first derive a recursive equation for the probability of two alternatives, in a changing environment, where the change process is memoryless and the change rate, , is symmetric and initially unknown to the observer (see Figure 1A). The most probable choice given the observations up to a time, , can be obtained from the log of the posterior odds ratio . The sign of indicates which option is more likely, and its magnitude indicates the strength of this evidence (Bogacz et al., 2006; Gold & Shadlen, 2007). Old evidence should be discounted according to the inferred environmental volatility. Since this is unknown, an ideal observer computes a probability distribution for the change rate, (see Figure 1C), along with the probability of environmental states.

Figure 1:

Online inference of the change rate in a dynamic environment. (A) The environment alternates between states and with transition probabilities . We analyze the symmetric case () in section 3.1 and the asymmetric case () in section 4. The state of the environment determines , which we represent as gaussian densities. (B) A sample path of the environment (color bar) together with the first 10 values of the actual change-point count, and non-change-point count, . (C) Evolution of the conditional probabilities, (given by beta distributions), corresponding to the change-point count from panel B, until . The dashed red line indicates the value of in the simulation. The densities are scaled so that each equals 1 at the mode.

Figure 1:

Online inference of the change rate in a dynamic environment. (A) The environment alternates between states and with transition probabilities . We analyze the symmetric case () in section 3.1 and the asymmetric case () in section 4. The state of the environment determines , which we represent as gaussian densities. (B) A sample path of the environment (color bar) together with the first 10 values of the actual change-point count, and non-change-point count, . (C) Evolution of the conditional probabilities, (given by beta distributions), corresponding to the change-point count from panel B, until . The dashed red line indicates the value of in the simulation. The densities are scaled so that each equals 1 at the mode.

Let be the number of change points and the count of non-change-points between times and (see Figure 1B). The process is a pure birth process with birth rate . The observer assumes no changes prior to the start of observation, , and must make at least two observations, and to detect a change.

To develop an iterative equation for the joint conditional probability density, , given the observations , we begin by marginalizing over these quantities at the time of the previous observation, , for (see section  A.1 for details):
formula
3.1
With two choices, we have the following relationships for all :
formula
3.2
The term in equation 3.1 is therefore nonzero only if either, , and , or and . If the system is in the joint state at , then at , it can either transition to or remain at . This observation is central to the message-passing algorithm described in Adams and Mackay (2007) and Wilson et al. (2010), with probability mass flowing from lower to higher values of according to a pure birth process (see Figure 2A). We can thus simplify equation 3.1, leaving only two terms in the double sum. Writing for , and similarly for any conditional probabilities, we have for :
Figure 2:

Inference of the states, , and change rate, . (A) The joint posterior probability, is propagated along a directed graph according to equation 3.14. Only paths corresponding to the initial condition are shown. (B) A sample sequence of environmental states (color bar, top) together with the first 10 observations (blue dots), for . Superimposed in black (right -axis) is the log-posterior odds ratio as a function of time. (C) Evolution of the posterior over (gray scale). The posterior mean (red) converges to the expected number of change points (dashed line). (D) Evolution of the posterior over the change rate (gray scale). The posterior mean (red) converges to the true value (dashed line), and the variance diminishes with the number of observations.

Figure 2:

Inference of the states, , and change rate, . (A) The joint posterior probability, is propagated along a directed graph according to equation 3.14. Only paths corresponding to the initial condition are shown. (B) A sample sequence of environmental states (color bar, top) together with the first 10 observations (blue dots), for . Superimposed in black (right -axis) is the log-posterior odds ratio as a function of time. (C) Evolution of the posterior over (gray scale). The posterior mean (red) converges to the expected number of change points (dashed line). (D) Evolution of the posterior over the change rate (gray scale). The posterior mean (red) converges to the true value (dashed line), and the variance diminishes with the number of observations.

formula
3.3
We must also specify initial conditions at time and boundary values when for these equations. At we have . Therefore,
formula
3.4
and for . Here is the prior over the two choices, which we typically take to be uniform so . The probability is unknown to the observer. However, similar to the ratio in equation 3.3, acts as a normalization constant and does not appear in the posterior odds ratio, (see equation 3.15). Finally, at all future times , we have separate equations at the boundaries,
formula
3.5
and
formula
3.6
We next compute in equation 3.1, with , by marginalizing over all possible transition rates :
formula
3.7
Note that , so we need the distribution of , given change points, for all . We assume that prior to any change point observations—that is, at time —the rates follow a beta distribution with hyperparameters (see also sections 3.1 and 3.2 in Wilson et al., 2010),
formula
where denotes the probability density of the associated beta distribution and is the beta function. For any , the random variable follows a binomial distribution with parameters for which the beta distribution is a conjugate prior. The posterior over the change rate when the change-point count is known at time is therefore
formula
3.8
For simplicity, we assume that prior to any observations, the probability over the transition rates is uniform, , for all , and therefore (see Figure 1C).
We now return to equation 3.7 and use the definition of the transition rate, , (see Figure 1) to find
formula
3.9
Equation 3.7 can therefore be rewritten using two integrals, depending on the values of and ,
formula
3.10
and similarly for .
The mean of the beta distribution, for , can be expressed in terms of its two parameters:
formula
3.11
We denote this expected value by as it represents a point estimate of the change rate at time when the change-point count is , . Since , we have
formula
3.12
The expected transition rate, is thus determined by the ratio between the previous change-point count and the number of time steps, . Leaving and as parameters in the prior gives . Using the definition in equation 3.12, it follows from equation 3.10 that
formula
3.13a
formula
3.13b
Equations 3.13a and 3.13b, illustrated in Figure 2A, can in turn be substituted into equation 3.3 to yield, for all :
formula
3.14
The initial conditions and boundary equations for this recursive probability update have already been described in equations 3.4 to 3.6. Equation 3.14 is the equivalent of equation 3 in Adams and Mackay (2007), and equation 3.7 in Wilson et al. (2010). However, here the observer does not need to estimate the length of the interval since the last change point. We demonstrate the inference process defined by equation 3.14 in Figure 2.
The observer can compute the posterior odds ratio by marginalizing over the change-point count:
formula
3.15
Here implies that is more likely than (see Figure 2B). Note that and need not be known to the observer to obtain the most likely choice.
A posterior distribution of the transition rate can also be derived from equation 3.14 by marginalizing over ,
formula
3.16
where is given by the beta distribution prior, equation 3.8. The expected rate is therefore
formula
3.17
Explicit knowledge of the transition rate, is not used in the inference process described by equation 3.14. However, computing it allows us to evaluate how the observer’s estimate converges to the true transition rate (see Figure 2D). We will also relate this estimate to the coupling strength between neural populations in the model described in section 6.

We conjecture that when measurements are noisy, the variance of the distribution does not converge to a point mass at the true rate, in the limit of infinitely many observations, ; that is, the estimate of is not consistent. As we have shown, to infer the rate, we need to infer the parameter of a Bernoulli variable. It is easy to show that the posterior over this parameter converges to a point mass at the actual rate value if the probability of misclassifying the state is known to the observer (Djuric & Huang, 2000). However, when the misclassification probability is not known, the variance of the posterior remains positive even in the limit of infinitely many observations. In our case, when measurements are noisy, the observer does not know the exact number of change points at a finite time. Hence, the observer does not know exactly how to weight previous observations to make an inference about the current state. As a result, the probability of misclassifying the current state may not be known. We conjecture that this implies that even in the limit , the posterior over has positive variance (see Figure 2D).

In Figure 3, we compare the performance of this algorithm in three cases: when the observer knows the true rate (point mass prior over the true rate ), when the observer assumes a wrong rate (point mass prior over an erroneous ), and when the observer learns the rate from measurements (flat prior over ). We define performance as the probability of a correct decision.

Figure 3:

The performance of the inference algorithm. (A) Performance under the interrogation paradigm measured as the percentage of correct responses at the interrogation time. Here and in the next panel, and SNR. The black curve represents the performance of an ideal observer who infers the change rate from measurements. The green curves represent the performance of observers who assume a fixed change rate (0.3, 0.15, 0.05, 0.03 from darker to lighter; see equation 2.1). The solid green line corresponds to an observer who assumes the true rate and the dashed lines to erroneous rates. (B) The green curve represents the performance at interrogation time of an observer who assumes a fixed change rate. The red star marks the maximum of this curve, corresponding to the true change rate . The horizontal black curves represent the performance at times (from bottom to top) of the observer who learns the change rate. (C) The accuracy as a function of the average threshold hitting time in the free response protocol. Here and SNR = 0.75. See section  A.2 for details on numerical simulations. See also Figure 3 in Veliz-Cuba et al. (2016).

Figure 3:

The performance of the inference algorithm. (A) Performance under the interrogation paradigm measured as the percentage of correct responses at the interrogation time. Here and in the next panel, and SNR. The black curve represents the performance of an ideal observer who infers the change rate from measurements. The green curves represent the performance of observers who assume a fixed change rate (0.3, 0.15, 0.05, 0.03 from darker to lighter; see equation 2.1). The solid green line corresponds to an observer who assumes the true rate and the dashed lines to erroneous rates. (B) The green curve represents the performance at interrogation time of an observer who assumes a fixed change rate. The red star marks the maximum of this curve, corresponding to the true change rate . The horizontal black curves represent the performance at times (from bottom to top) of the observer who learns the change rate. (C) The accuracy as a function of the average threshold hitting time in the free response protocol. Here and SNR = 0.75. See section  A.2 for details on numerical simulations. See also Figure 3 in Veliz-Cuba et al. (2016).

Under the interrogation protocol, the observer infers the state of the environment at a fixed time. As expected, performance increases with interrogation time and is highest if the observer uses the true rate (see Figure 3A and equation 2.1). Performance plateaus quickly when the observer assumes a fixed rate and more slowly if the rate is learned. The performance of observers who learn the rate slowly increases toward that of observers who know the true rate. In Figure 3B, we present the performance of the unknown-rate algorithm at four different times () and compare it to the asymptotic values with different assumed rates (green curves).

Note that an observer who assumes an incorrect change rate can still perform near optimally (e.g., curve for 0.03 in Figure 3A), especially when the signal-to-noise ratio (SNR) is quite high. The SNR is the difference in means of the likelihoods divided by their common standard deviation. Change rate inference is more effective at lower SNR values, in which case multiple observations are needed for an accurate estimate of the present state. However, at very low SNR values, the observer will not be able to substantially reduce uncertainty about the change rate, resulting in high uncertainty about the state.

In the free response protocol, the observer makes a decision when the log-odds ratio reaches a predefined threshold. In Figure 3C, we present simulation results for this protocol in a format similar to Figure 3A, with empirical performance as a function of average hitting time. Each performance level corresponds to unique log-odds threshold. Similar to the interrogation protocol (see Figure 3A), the performance of the free response protocol saturates much more quickly for an observer who fixes the change rate estimate than one that infers this rate over time.

3.2  Symmetric Multistate Process

We next consider evidence accumulation in an environment with an arbitrary number of states, , with symmetric transition probabilities, , whenever . We define for any , so that the probability of remaining in the same state becomes , for all . The symmetry in transition rates means that an observer still needs only to track the total number of change points, as in section 3.1.

Equations 3.1 and 3.2 remain valid with possible choices, . When , the double sum in equation 3.1 simplifies to
formula
As in section 3.1, we have and for , where describes the observer’s belief prior to any observations. At all future times, , we have at the boundaries for all
formula
and
formula
Equation 3.7 remains unchanged, and we still have . Furthermore, assuming a beta prior on the change rate, equation 3.8, remains valid, and equation 3.9 is replaced by
formula
The integral from equation 3.7 gives, once again, the mean of the beta distribution, defined in equations 3.11 and 3.12. As in section 3.1, is a point estimate of the change rate at time when the change-point count is . We have
formula
3.18
and the main probability update equation is now
formula
The observer can infer the most likely state of the environments by computing the index that maximizes the posterior probability, marginalizing over all change-point counts,
formula
The observer can also compute the posterior probability of the transition rate by marginalizing over all states and change-point counts as in equation 3.16. Furthermore, a point estimate of is given by the mean of the posterior after marginalizing, as in equation 3.17.

4  Environments with Asymmetric Transition Rates

In this section, we depart from the framework of Adams and Mackay (2007) and Wilson et al. (2010) and consider unequal transition rates between states. This includes the possibility that some transitions are not allowed. We consider an arbitrary number, of states with unknown transition rates, between them. The switching process between the states is again memoryless, so that is a stationary, discrete-time Markov chain with finite state space, . We write the (unknown) transition matrix for this chain as a left stochastic matrix,
formula
where , with . We denote by the th column of the matrix , and similarly for other matrices. Each such column sums to 1. We define the change-point counts matrix at time as
formula
where is the number of transitions from state to state up to time . There can be a maximum of transitions at time . For a fixed , all entries in are nonnegative and sum to , that is, . As in the symmetric case, the change-point matrix at time must be the zero matrix, .

We will show that our inference algorithm assigns positive probability only to change-point matrices that correspond to possible transition paths between the states . Many nonnegative integer matrices with entries that sum to are not possible change-point matrices . A combinatorial argument shows that when , the number of possible pairs, grows quadratically with the number of steps, to leading order. It can also be shown that the growth is polynomial for , although we do not know the growth rate in general (see Figure 4B). An ideal observer has to assign a probability of each of these states, which is much more demanding than in the symmetric rate case where the number of possible states grows linearly in .

Figure 4:

Evidence accumulation and change rates inference in a two-state asymmetric system. (A) Sample path (color bar, top) of the environment between times and (same simulation as in panels C–E) with corresponding observations (blue dots) and log-posterior odds ratio (black step function). Here and in panels C–E, , SNR. (B) The number of allowable change-point matrices as a function of observation number, , for (blue circles) and (blue triangles). (C–E) Color plots (gray scale) of the joint density, with mean value (red star) approaching the true transition rates (green circle).

Figure 4:

Evidence accumulation and change rates inference in a two-state asymmetric system. (A) Sample path (color bar, top) of the environment between times and (same simulation as in panels C–E) with corresponding observations (blue dots) and log-posterior odds ratio (black step function). Here and in panels C–E, , SNR. (B) The number of allowable change-point matrices as a function of observation number, , for (blue circles) and (blue triangles). (C–E) Color plots (gray scale) of the joint density, with mean value (red star) approaching the true transition rates (green circle).

We next derive an iterative equation for , the joint probability of the state , and an allowable combination of the change-point counts (off-diagonal terms of ) and non-change-point counts (diagonal terms of ). The derivation is similar to the symmetric case. For , we first marginalize over and ,
formula
where the sum is over all and possible values of the change-point matrix, .
Using and applying Bayes’ rule to write
formula
gives
formula
4.1

We compute the conditional probability by marginalizing over all possible transition matrices, . To do so, we relate the probabilities of and . Note that if the observer assumes the columns are independent prior to any observations, then the exit rates conditioned on the change-point counts, , are independent for all states, .

To motivate the derivation, we first consider a single state, and assume that the environmental state has been observed perfectly over time steps, but the transition rates are unknown. Therefore, all are known to the observer , but the are not. The state of the system at time , given that it was in state at time is a categorical random variable, and , for . The observed transitions are independent samples from a categorical distribution with unknown parameters .

The conjugate prior to the categorical distribution is the Dirichlet distribution, and we therefore use it as a prior on the change-point probabilities. For simplicity, we again assume a flat prior over , that is, , where is the indicator function on the standard -simplex, .

Denote by the sequence of states that the environment transitioned to at time whenever it was in state at time , for all . Therefore, is a sequence of states from the set . By definition, , where is the indicator function, which is unity only when and and zero otherwise. Equivalently, we can write , since . For general , the posterior distribution for the transition probabilities given the change-point vector is then
formula
Here , so should be interpreted as the vector with entries , is the gamma function, and the probability density function of the -dimensional Dirichlet distribution, .
The same argument applies to all initial states, , We assume that the transition rates are conditionally independent, so that
formula
4.2
Using this observation, the transition probability between two states can be computed by marginalizing over all possible transition matrices, conditioned on ,
formula
4.3
where represents the space of all left stochastic matrices and is the -dimensional simplex of such that .
Let be the matrix containing a 1 as its th entry, and 0 everywhere else. For all we have
formula
4.4
Implicit in equation 4.4 is the requirement that the environment must have been in state in order for the transition to have occurred between and . This will ensure that the change-point matrices that are assigned nonzero probability correspond to admissible paths through the states . Applying equation 4.4, we can compute the integrals in equation 4.3 for all pairs . We let to simplify notation and find
formula
4.5
As in the point estimate of the rate in equation 3.12, each is a ratio containing the number of transitions in the numerator and the total number of transitions out of the th state in the denominator. Thus, the estimated transition rate increases with the number of transitions in a given interval . Furthermore, each column sums to unity;
formula
so the point estimates for the transition rates out of each state provide an empirical probability mass function along each column. However, as in the symmetric case, these estimates are biased toward the interior of the domain. This is a consequence of the hyperparameters we have chosen for our prior density, .
Therefore, for , the probability update equation in the case of asymmetric transition rates, equation 4.1, is given by
formula
4.6
The point estimates of the transition rates, , are defined in equation 3.5. As before, and for any . At future times, it is only possible to obtain change-point matrices whose entries sum to ; the change-point matrices and must be related as as noted in equation 4.4. This considerably reduces the number of terms in the sum in equation 4.6.
The observer can find the most likely state of the environment by maximizing the posterior probability after marginalizing over the change-point counts ,
formula
The transition rate matrix can also be computed by marginalizing across all possible states, and change-point count matrices, ,
formula
where is the product of probability density functions, given in equation 4.2. The mean of this distribution is given by
formula
4.7
where , defined in equation 4.5, is a conditional expectation over each possible change-point matrix .
Equation 4.6 is easier to interpret when . Using equation 4.5, we find
formula
and we can express and . Expanding the sum in equation 4.6, we have
formula
4.8a
formula
4.8b
The boundary and initial conditions will be given as above, and the mean inferred transition matrix is given by equation 4.7. Importantly, the inference process described by equations 4.8a and 4.8b allows for both asymmetric change-point matrices, and inferred transition rate matrices , unlike the process in equation 3.14. However, the variance of the posteriors over the rates will decrease more slowly, as fewer transitions out of each particular state will be observed.

This algorithm can be used to infer unequal transition rates as shown in Figure 4. Figures 4C through 4E show that the mode of the joint posterior distribution, approaches the correct rates, while its variance decreases. As in section 3.1, we conjecture that this joint density does not converge to a point mass at the true rate values unless the SNR is infinite.

5  Continuum Limits and Stochastic Differential Equation Models

We next derive continuum limits of the discrete probability update equations for the symmetric case discussed in section 3. We assume that observers make measurements rapidly, so we can derive a stochastic differential equation (SDE) that models the update of an ideal observer’s belief (Gold & Shadlen, 2007). SDEs are generally easier to analyze than their discrete counterparts (Gardiner, 2004). For example, response times can be studied by examining mean first passage times of log-likelihood ratios (Bogacz et al., 2006), or log likelihoods (McMillen & Holmes, 2006), which is much easier done in the continuum limit (Redner, 2001). For simplicity, we begin with an analysis of the two-state process and then extend our results to the multistate case. The full inference model, Figure 5A, in the two-state case can be reduced using moment closure to truncate the resulting infinite system of SDEs to an approximate finite system (see Figure 5B). This both saves computation time and suggests a potential mechanism for learning the rate of environmental change. We map this approximation to a neural population model in section 6 (see Figure 5C). This model consists of populations that track the environmental state and synaptic weights that learn the transition rate .

Figure 5:

Schematic showing the reduction of the full inference model, equation 5.7, for a two-state () symmetric environment () carried out in sections 5 and 6. (A) Observations arrive continuously in time and are used to update the probabilities that the environment is in state after change points. (B) Red and pink arrows from panels A to B represent, respectively, the summation and averaging of over to obtain equation 5.1b for the zeroth and first moments in section 5.2. Arrows from have been omitted for clarity. (C) Moment equations are converted to a neural population model, equation 6.7, by assigning the probabilities to population variables, , and the ratio of first and zeroth moments to synaptic weights, . Orange arrows from panel B demonstrate this mapping. The transition rate is learned by changes in the weights . If the observer assumes or knows the rate ahead of time, the weights remain fixed.

Figure 5:

Schematic showing the reduction of the full inference model, equation 5.7, for a two-state () symmetric environment () carried out in sections 5 and 6. (A) Observations arrive continuously in time and are used to update the probabilities that the environment is in state after change points. (B) Red and pink arrows from panels A to B represent, respectively, the summation and averaging of over to obtain equation 5.1b for the zeroth and first moments in section 5.2. Arrows from have been omitted for clarity. (C) Moment equations are converted to a neural population model, equation 6.7, by assigning the probabilities to population variables, , and the ratio of first and zeroth moments to synaptic weights, . Orange arrows from panel B demonstrate this mapping. The transition rate is learned by changes in the weights . If the observer assumes or knows the rate ahead of time, the weights remain fixed.

5.1  Derivation of the Continuum Limit

5.1.1  Two-State Symmetric Process

We first assume that the state of the environment, , is a homogeneous continuous-time Markov chain with state-space . The probability of transitions between the two states is symmetric and given by , where . The number of change points, up to time is a Poisson process with rate . An observer infers the present state from a sequence of observations, , made at equally spaced times,2 with . Each observation, has probability (see Veliz-Cuba et al., 2016, for more details). We again use the notation where is the time of the th observation.

As in the previous sections, an estimate of the rate parameter, is obtained from the posterior distribution over the change-point count, , at the time of the th observation, . For simplicity, we assume a gamma prior with parameters and over , so that . By assumption, the change-point count follows a Poisson distribution with parameter , so that . Therefore, once change points have been observed, we have the posterior distribution , that is,
formula
5.1
We can substitute equation 5.1 into equation 3.7 describing the probability of transitions between time and to find
formula
5.2
where is the density of the gamma distribution. Using the definition of the transition rate , we can relate it to the first conditional probability in the integral of equation 5.2 via
formula
5.3
We have dropped the terms as we are interested in the limit .
Using equation 5.3 and properties of the gamma distribution, we can evaluate the integral in equation 5.2 to obtain
formula
5.4
We can use equation 5.4 in the update equation, equation 3.1, to obtain the probabilities of given observations . As before, only terms involving and remain in the sum for . Using the same notational convention as in previous sections, we obtain
formula
5.5
Note that equation 5.5 is similar to the update equation 3.14 we derived in section 3, with the time index replaced by up to the term. Also, since we have used a gamma instead of a beta distribution as a prior, the point estimate of the transition rate is slightly different (see equation 3.12). As in the discrete time case, a point estimate of the transition rate is required even before the first change point can be observed. We therefore cannot use an improper prior as the rate point estimate would be undefined.
To take the limit of equation 5.5 as , we proceed as in Bogacz et al. (2006) and Veliz-Cuba et al. (2016), working with logarithms of the probabilities. Dividing equation 5.5 by , taking logarithms of both sides, and using the notation , we obtain3
formula
Using the approximation for small yields
formula
Since the proportionality constant is equal for all , we drop it in the SDE for the log-likelihood (see Veliz-Cuba et al., 2016, for details of the derivation),
formula
5.6
where and satisfies with for .
Note that equation 5.6 is an infinite set of differential equations, one for each pair , . The initial conditions at are given by . To be consistent with the prior over the rate, , we can choose a Poisson prior over with mean, . The initial conditions for equation 5.6 are given by . Note also that equation 5.6 at the boundary is a special case. Since at there is no influx of probability from equation 5.6 reduces to
formula
Finally, note that we can obtain evolution equations for the likelihoods, by applying the change of variables . Itô’s change of coordinates rules (Gardiner, 2004) implies that equation 5.6 is equivalent to
formula
5.7
where now initial conditions at are simply . We will compare the full system, equation 5.7, with an approximation using a moment expansion in section 5.2 (see also Figure 5).

5.1.2  Two States with Asymmetric Rates

Next we consider the case where the state of the environment, , is still a continuous-time Markov chain with state-space , but the probabilities of transition between the two states are asymmetric: , , where . Thus, we must separately enumerate change points, and , to obtain an estimate of the rates and . In addition, we will rescale the enumeration of non-change-points so that , in anticipation of the divergence of as . This will mean the total dwell time, will be continuous, while the change-point count will be discrete. The quantities are then placed into a matrix, , where for and . Note that if the number of change points, , and the total dwell time in a state, , were known, the change rate could be estimated as .

As before, we will estimate the rate parameters, , using the posterior probability of the change-point matrix, . We assume gamma priors on each rate, so that . By assumption, the change-point count, follows a Poisson distribution with parameter , so that . Therefore, once change points have been observed along with the dwell time , we have the posterior distribution , so
formula
5.8
We now derive the continuum limit of equation 4.8. One key step of the derivation is the application of a change of variables to the change-point matrix , where we replace the non-change-point counts with dwell times , defined as for . This is necessary due to the divergence of as . In the limit , the modified change-point matrix becomes
formula
where is the change-point count from , while is the dwell time in state . Thus, taking logarithms, linearizing, and taking the limit , we obtain the following system of stochastic partial differential equations (SPDEs) for the log likelihoods, :
formula
5.9a
formula
5.9b
where the drift, and noise, are defined as before (for details, see section  A.3). Note that the flux terms, account for the flow of probability to longer dwell times . For example, the SPDE for has a flux term for the linear increase of the dwell time since this represents the environment remaining in state . These flux terms simply propagate the probability densities over the space , causing no net change in the probability of residing in either state : .
Equation 5.9 generalizes equation 4.8 as an infinite set of SPDEs, indexed by the discrete variables where . Each SPDE is over the space , and it is always true that . Initial conditions at are given by . For consistency with the prior on the rates, , we choose a Poisson prior over the change-point counts , and a Dirac delta distribution prior over the dwell times ,
formula
5.10
As before, equation 5.9 at the boundaries and is a special case, since there will be no influx of probability from or .
As in the symmetric case, we can convert equation 5.9 to equations describing the evolution of the likelihoods . Applying the change of variables , we find
formula
5.11a
formula
5.11b
where now initial conditions at are .

5.1.3  Multiple States with Symmetric Rates

The continuum limit in the case of states, , with symmetric transition rates can be derived as with (see section  A.4 for details). Again, denote the transition probabilities by and the rate of switching from one to any other state by .

Assuming again a gamma prior on the transition rate, and introducing , we obtain the SDE
formula
5.12
where and satisfies with .

Equation 5.12 is again an infinite set of stochastic differential equations, one for each pair , , . We have some freedom in choosing initial conditions at . For example, since , we can use the Poisson distribution discussed in the case of two states.

The posterior over the transition rate, is
formula
where is the gamma distribution given by equation 4.9. Similar to equation 3.17, the expected rate is
formula
An equivalent argument can be used to obtain the posterior over the rates in the asymmetric case with states.

5.2  Moment Hierarchy for the Two-State Process

In the previous section, we approximated the evolution of the joint probabilities of environmental states and change-point counts. The result, in the symmetric case, was an infinite set of SDEs, one for each combination of state and change-point values . However, an observer is mainly concerned with the current state of the environment. The change-point count is important for this inference but may not be of direct interest itself. We next derive simpler, approximate models that do not track the entire joint distribution over all change-point counts, only essential aspects of this distribution. We do so by deriving a hierarchy of iterative equations for the moments of the distribution of change-point counts, , focusing specifically on the two-state symmetric case.

Our goal in deriving moment equations is to have a low-dimensional, and reasonably tractable, system of SDEs. Similar to previous studies of sequential decision-making algorithms (Bogacz et al., 2006), such low-dimensional systems can be used to inform neurophysiologically relevant population rate models of the evidence accumulation process. To begin, we consider the infinite system of SDEs given in the two-state symmetric case, equation 5.7. Our reduction then proceeds by computing the SDEs associated with the lower order (zeroth, first, and second) moments over the change-point count :
formula
5.13
We denote the moments using bars (). Below, when we discuss cumulants, we will represent them using hats . Note that the “zeroth” moments are the marginal probabilities of and .
We begin by summing equation 5.7 over all and applying equation 5.13 to find this generates an SDE for the evolution of the moments given
formula
5.14
where we have used the fact that
formula
The SDE given by equation 5.14 for the zeroth moment,