## Abstract

Adaptively optimizing experiments has the potential to significantly reduce the number of trials needed to build parametric statistical models of neural systems. However, application of adaptive methods to neurophysiology has been limited by severe computational challenges. Since most neurons are high-dimensional systems, optimizing neurophysiology experiments requires computing high-dimensional integrations and optimizations in real time. Here we present a fast algorithm for choosing the most informative stimulus by maximizing the mutual information between the data and the unknown parameters of a generalized linear model (GLM) that we want to fit to the neuron's activity. We rely on important log concavity and asymptotic normality properties of the posterior to facilitate the required computations. Our algorithm requires only low-rank matrix manipulations and a two-dimensional search to choose the optimal stimulus. The average running time of these operations scales quadratically with the dimensionality of the GLM, making real-time adaptive experimental design feasible even for high-dimensional stimulus and parameter spaces. For example, we require roughly 10 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Despite using some approximations to make the algorithm efficient, our algorithm asymptotically decreases the uncertainty about the model parameters at a rate equal to the maximum rate predicted by an asymptotic analysis. Simulation results show that picking stimuli by maximizing the mutual information can speed up convergence to the optimal values of the parameters by an order of magnitude compared to using random (nonadaptive) stimuli. Finally, applying our design procedure to real neurophysiology experiments requires addressing the nonstationarities that we would expect to see in neural responses; our algorithm can efficiently handle both fast adaptation due to spike history effects and slow, nonsystematic drifts in a neuron's activity.

## 1. Introduction

In most neurophysiology experiments, data are collected according to a design that is finalized before the experiment begins. During the experiment, the data already collected are rarely analyzed to evaluate the quality of the design. These data, however, often contain information that could be used to redesign experiments to better test hypotheses (Fedorov, 1972; Chaloner & Verdinelli, 1995; Kontsevich & Tyler, 1999; Warmuth et al., 2003; Roy, Ghosal, & Rosenberger in press). Adaptive experimental designs are particularly valuable in domains where data are expensive or limited. In neuroscience, experiments often require training and caring for animals, which can be time-consuming and costly. As a result of these costs, neuroscientists are often unable to conduct large numbers of trials using different subjects. The inability to collect enough data makes it difficult for them to investigate high-dimensional, complex neural systems. By using adaptive experimental designs, neuroscientists could potentially collect data more efficiently. In this article, we develop an efficient algorithm for optimally adapting the experimental design in one class of neurophysiology experiments.

*r*denote the stimulus and firing rate at time

_{t}*t*, respectively. When we optimize the input for time

*t*+ 1, we can control only , as the rest of the components of the input (i.e., past stimuli and responses) are fixed. To distinguish the controllable and fixed components of the input, we use the subscripts

*x*and

*f*: is the input at time

*t*. is a vector comprising the past stimuli and responses on which the response at time

*t*depends.

*t*and

_{k}*t*are how far back in time the dependence on the stimuli and responses stretches (i.e., if

_{a}*t*= 0 and

_{k}*t*= 0, then ). Not all models will include a dependence on past stimuli or responses; the values of

_{a}*t*and

_{k}*t*will depend on the model adopted for a particular experiment.

_{a}We can describe a model that incorporates all of these features by specifying the conditional distribution of the responses given the input. This distribution gives the probability of observing response *r _{t}* at time

*t*given the input . We use a distribution as opposed to a deterministic function to specify the relationship between

*r*and because a neuron's response varies for repeated presentations of a stimulus. To simplify the model, we restrict our consideration to parametric distributions that lie in some space Θ. Each vector denotes a particular model in this space. To fit a model, , to a neuron, we need to find the best value of .

_{t}We estimate by observing the neuron's response to various stimuli. For these experiments, the design is a procedure for picking the stimulus on each trial. The design can be specified as a probability distribution, , from which we sample the stimulus on each trial. Nonrandom designs can be specified by putting all the probability mass on a single stimulus. A sequential design modifies this distribution after each observation. In contrast, the standard nonsequential approach is to fix this distribution before the experiment starts, and then select the stimulus on each trial by drawing independent and identically distributed (i.i.d.) samples from . Figure 1 provides a schematic of the sequential approach we want to implement, as well as a diagram of the typical i.i.d. design.

Designing experiments by maximizing the mutual information is computationally challenging. The information we expect to gain from an experiment depends on what we have already learned from past observations. To extract the information from past observations, we need to compute the posterior distribution after each trial. Once we have updated the posterior, we need to use it to compute the expected information gain from future experiments; this requires a high-dimensional integration over the space Θ. Maximizing this integral with respect to the design requires a nonlinear search over the high-dimensional stimulus space, . In sensory neurophysiology, the stimulus space is high-dimensional because the stimuli tend to be complex, spatiotemporal signals like movies and sounds. The challenge of evaluating this high-dimensional integral and solving the resulting nonlinear optimization has impeded the application of adaptive experimental design to neurophysiology. In the worst case, the complexity of these operations will grow exponentially with the dimensionality of and . For even moderately sized spaces, direct computation will therefore be intractable, particularly if we wish to adapt the design in a real-time application.

The main contribution of this article is to show how these computations can be performed efficiently when Θ is the space of generalized linear models (GLM) and the posterior distribution on is approximated as a gaussian. Our solution depends on some important log-concavity and rank-one properties of our model. These properties justify the gaussian approximation of the posterior distribution and permit a rapid update after each trial. These properties also allow optimization of the mutual information to be approximated by a tractable two-dimensional problem that can be solved numerically. The solution to this 2D optimization problem depends on the stimulus domain. When the stimulus domain is defined by a power constraint, we can easily find the nearly optimal design. For arbitrary stimulus domains, we present a general algorithm for selecting the optimal stimulus from a finite subset of stimuli in the domain. Our analysis leads to efficient heuristics for constructing this subset to ensure the resulting design is close to the optimal design.

Our algorithm facilitates estimation of high-dimensional systems because picking more informative designs leads to faster convergence to the best model of the neuron. In our simulations (see section 5.4), the optimal design converges more than an order of magnitude faster than an i.i.d. design. Our algorithm can be applied to high-dimensional, real-time applications because it reduces the complexity with respect to dimensionality from exponential to on average quadratic running time.

This article is organized as follows. In section 2, we present the GLM of neural systems. In section 3, we present an online method for computing a gaussian approximation of the posterior distribution on the GLM's parameters. In section 4, we show how the mutual information, , can be approximated by a much simpler, low-dimensional function. In section 5, we present the procedure for picking optimal stimuli and show some simulation results. In section 6, we generalize our basic methods to some important extensions of the GLM needed to handle more complicated experiments. In section 7, we show that our algorithm asymptotically decreases the uncertainty about at a rate nearly equal to the optimal rate predicted by a general theorem on the rate of convergence of information maximizing designs (Paninski, 2005). We therefore conclude that this efficient (albeit approximate) implementation produces designs that are in fact asymptotically optimal. Simulations investigating the issue of model misspecification are presented in section 8. Finally, we discuss some limitations and directions for future work in section 9. To help the reader, we summarize in Table 1 the notation that we will use in the rest of the article.

## 2. The Parametric Model

For the model space, Θ, we choose the set of generalized linear models (GLM) (see Figure 2). The GLM is a tractable and flexible parametric family that has proven useful in neurophysiology (McCullagh & Nelder, 1989; Simoncelli, Paninski, Pillow, & Schwartz, 2004; Paninski, 2004; Truccolo et al., 2005; Paninski, Pillow, & Lewi, 2007). GLMs are fairly natural from a physiological point of view, with close connections to biophysical models such as the integrate-and-fire cell. Consequently, they have been applied in a wide variety of experimental settings (Brillinger, 1988, 1992; Chichilnisky, 2001; Theunissen et al., 2001; Paninski, Shoham, Fellows, Hatsopoulos, & Donoghue, 2004).

_{t}, which is a nonlinear function of the input, As noted earlier, the response at time

*t*depends on the current stimulus, , as well as past stimuli and responses. The inclusion of spike history in the input means we can account for refractory effects, burstiness, and firing-rate adaptation (Berry & Meister, 1998; Keat et al., 2001; Paninski, 2004; Truccolo et al., 2005). As noted earlier, we use subscripts to distinguish the components that we can control from those that are fixed (see Table 1).

*f*(), known as the link function. The input-output relationship of the neuron is fully specified by the log likelihood of the response given the input and ,

*dt*is the length of the time window over which we measure the firing rate,

*r*. The constant term is constant with respect to but not

_{t}*r*. In this article, we always use a Poisson distribution for the conditional likelihood, , because it is the best one for modeling spiking neurons. However, by making some minor modifications to our algorithm, we can use it with other distributions in the exponential family (Lewi, Butera, & Paninski, 2007).

_{t}*f*() is a convex and log concave function (Wedderburn, 1976; Haberman, 1977; McCullagh & Nelder, 1989; Paninski, 2004).

*f*() can be convex and log concave only if its contours are linear. When the contours are linear, we can, without loss of generality, assume that

*f*() is a function of a scalar variable, ρ

_{t}. ρ

_{t}is the result of applying the linear filter of the GLM to the input, Since ρ

_{t}is a scalar, must be a vector and not a matrix. Convexity of

*f*() also guarantees that the nonlinearity is monotonic. Since we can always multiply by negative 1 (i.e., flip our coordinate system), we can without loss of generality assume that

*f*is increasing. Furthermore, we assume

*f*() is known, although this condition could potentially be relaxed. Knowing

*f*() exactly is not essential because previous work (Li & Duan, 1989; Paninski, 2004) and our own results, (see section 8) indicate that the parameters of a GLM can often be estimated, at least up to a scaling factor, even if the link function is incorrect.

## 3. Representing and Updating the Posterior

Our first computational challenge is representing and updating the posterior distribution on the parameters, . We use a fast, sequential procedure for constructing a gaussian approximation of the posterior, (see Figure 3). This gaussian approximation leads to an update that is both efficient and accurate enough to be used online for picking optimal stimuli.

A gaussian approximation of the posterior is justified by the fact that the posterior is the product of two smooth, log-concave terms—the GLM likelihood function and the prior (which we assume to be gaussian, for simplicity). As a result, the log posterior is concave (i.e., it always curves downward) and can be well approximated by the quadratic expression for the log of a gaussian. Furthermore, the main result of Paninski (2005) is a central limit-like theorem for optimal experiments based on maximizing the mutual information. This theorem guarantees that asymptotically, the gaussian approximation of the posterior will be accurate.

*t*− 1, , summarizes the information in the first

*t*− 1 trials, we can use this distribution to approximate the log posterior after the

*t*th trial,

We fit the log of a gaussian to the approximation of the log posterior in equation 3.2, using the Laplace method (Berger, 1985; MacKay, 2003). This recursive approach is much faster, albeit slightly less accurate, then using the Laplace method to fit a gaussian distribution to the true posterior. The running time of this recursive update is *O*(*d*^{2}), whereas fitting a gaussian distribution to the true posterior is *O*(*td*^{3}). Since *t* and *d* are large, easily *O*(10^{3}), the computational savings of the recursive approach are well worth the slight loss of accuracy. If the dimensionality is low, *d* = *O*(10), we can measure the error by using Monte Carlo methods to compute the Kullback-Leibler distance between the true posterior and our gaussian approximation. This analysis (results not shown) reveals that the error is small and rapidly converges to zero.

*C*_{t−1}is nonsingular, must be parallel to , Δ

_{t}is a scalar that measures the magnitude of the difference, . We find Δ

_{t}by solving the following one-dimensional equation using Newton's method: This equation defines the location of the peak of the log posterior in the direction . Since the log posterior is concave, equation 3.7 is the solution to a one-dimensional concave optimization problem. Equation 3.7 is therefore guaranteed to have a single, unique solution. Solving this one-dimensional problem involves a single matrix-vector multiplication that requires

*O*(

*d*

^{2}) time.

*C*_{t}of the posterior by forming the Taylor approximation of equation 3.2 about : The Laplace method uses the curvature of the log posterior as an estimate of the inverse covariance matrix. The larger the curvature, the more certain we are that our estimate is close to the true parameters. The curvature, as measured by the second derivative, is the sum of two terms, equation 3.10. The first term approximates the information provided by the first

*t*− 1 observations. The second term measures the information in our latest observation,

*r*. The second term is proportional to the Fisher information. By definition, the Fisher information is the negative of the second derivative of the log likelihood (Berger, 1985). The second derivative of the log likelihood provides an intuitive metric for the informativeness of an observation because a larger second derivative means that small differences in produce large deviations in the responses. Hence, a large Fisher information means we can infer the parameters with more confidence.

_{t}

*C*^{−1}

_{t−1}and the Hessian of . The Hessian of the log likelihood is a rank 1 matrix. We can therefore efficiently invert the Hessian of the updated log posterior in

*O*(

*d*

^{2}) time using the Woodbury matrix lemma (Henderson & Searle, 1981; Seeger, 2007). Evaluating the derivatives in equation 3.10 and using the Woodbury lemma yields

*D*(

*r*, ρ

_{t}_{t}) is the one-dimensional Fisher information—the negative of the second derivative of the log likelihood with respect to ρ

_{t}. In this equation, ρ

_{t}depends on the unknown parameters, , because we would like to compute the Fisher information for the true parameters. That is, we would like to expand our approximation of the log posterior about . Since is unknown, we use the approximation to compute the new covariance matrix. Since computing the covariance matrix is just a rank one update, computing the updated gaussian approximation requires only

*O*(

*d*

^{2}) computations. A slower but potentially more accurate update for small

*t*would be to construct our gaussian by matching the first and second moments of the true posterior distribution using the expectation propagation algorithm (Minka, 2001; Seeger, Gerwinn, & Bethge, 2007).

Asymptotically under suitable regularity conditions, the mean of our gaussian is guaranteed to converge to the true . Consistency can be established by applying theorems for the consistency of estimators based on stochastic gradient descent (Fabian, 1978; Sharia, 2007). We used numerical simulations (data not shown) to verify the predictions of these theorems. To apply these theorems to our update, we must be able to restrict to a closed and bounded space. Since all corresponding to neural models would naturally be bounded, this constraint is satisfied for all biologically reasonable GLMs.

Our update uses the Woodbury lemma, which is unstable when *C*_{t} is close to being singular. When optimizing under a power constraint (see section 5.2), we can avoid using the Woodbury lemma by computing the eigendecomposition of the covariance matrix. Since we need to compute the eigendecomposition in order to optimize the stimulus, no additional computation is required in this case. When the eigendecomposition was not needed for optimization, we usually found that the Woodbury lemma was sufficiently stable. However, a more stable solution in this case would have been to compute and maintain the Cholesky decomposition of the covariance matrix (Seeger, Steinke, & Tsuda, 2007).

## 4. Computing the Mutual Information

Before diving into a detailed mathematical computation, we want to provide a less technical explanation of our approach. Before we conduct any trials, we have a set, Θ, of possible models. For any stimulus, each model in Θ makes a prediction of the response. To identify the best model, we should pick a stimulus that maximizes the disagreement between the predictions of the different models. In theory, we could measure the disagreement for any stimulus by computing the predicted response for each model. However, since the number of possible models is large, explicitly computing the response for each model is rarely possible.

We can compute the mutual information efficiently because once we pick a stimulus, we partition the model space, Θ, into equivalent sets with respect to the predicted response. Once we fix , the likelihood of the responses varies only with the projection . Hence, all models with the same value for ρ_{t+1} make the same prediction. Therefore, instead of computing the disagreement among all models in Θ space, we only have to compute the disagreement between the models in these different subspaces; that is, at most, we have to determine the response for one model in each of the subspaces defined by ρ_{t+1} = *const*.

Of course, the mutual information also depends on what we already know about the fitness of the different models. Since our experiment provides no information about in directions orthogonal to , our uncertainty in these directions will be unchanged. Therefore, the mutual information will only depend on the information we have about in the direction ; that is, it depends only on instead of our full posterior .

Furthermore, we have to evaluate the mutual information only for nonrandom designs because any optimal design must place all of its mass on the stimulus, , which maximizes the conditional mutual information (MacKay, 1992; Paninski, 2005). This property means we can focus on the simpler problem of efficiently evaluating as a function of the input .

*t*. Since is independent of , we just need to minimize the second term, which measures how uncertain about we expect to be after the next trial. Our uncertainty at time

*t*+ 1 depends on the response to the stimulus. Since

*r*

_{t+1}is unknown, we compute the expected entropy of the posterior, , as a function of

*r*

_{t+1}and then take the average over

*r*

_{t+1}using our GLM to compute the likelihood of each

*r*

_{t+1}(MacKay, 1992; Chaloner & Verdinelli, 1995). Since the likelihood of

*r*

_{t+1}depends on the unknown model parameters, we also need to take an expectation over . To evaluate the probability of the different , we use our current posterior, .

*r*

_{t+1}by first approximating as gaussian. The entropy of a gaussian is easy to compute (Cover & Thomas, 1991): According to our update rule, As discussed in the previous section, the Fisher information depends on the unknown parameters. To compute the entropy, we treat the Fisher information, as a random variable since it is a function of . We then estimate our expected uncertainty as the expectation of with respect to using the posterior at time

*t*. The mutual information, equation 4.2, already entails computing an average over , so we do not need to introduce another integration.

*t*, we use the point estimate to estimate the Fisher information of the observation at time

*t*. We could apply the same principle to compute the expected posterior entropy by using the approximation where is computed using equations 3.6 and 3.7. Using this approximation of ρ

_{t+1}is intractable because we would need to solve for numerically for each value of

*r*

_{t+1}. We could solve this problem by using the point approximation , which we can easily compute since is known (MacKay, 1992; Chaudhuri & Mykland, 1993; Cohn, 1994). This point approximation means we estimate the Fisher information for each possible using the assumption that . Unless happens to be close to , there is no reason that the Fisher information computed assuming should be close to the Fisher information evaluated at the true parameters. In particular, at the start of an experiment when is highly inaccurate, we would expect this point approximation to lead to poor estimates of the Fisher information. Similarly, we would expect this point approximation to fail for time-varying systems as the posterior covariance may no longer converge to zero asymptotically (see section 6.2). In contrast to using a point approximation, our approach of averaging the Fisher information with respect to should provide much better estimates of the Fisher information when our uncertainty about is high or when is changing (Lindley, 1956; Chaloner & Verdinelli, 1995). Averaging the expected information of with respect to our posterior leads to an objective function, which takes into account all possible models. In particular, it means we favor inputs that are informative under all models with high probability as opposed to inputs that are informative only if .

*C*_{t+1}is a rank 1 update of

*C*_{t}. Hence, we can use the the identity to evaluate the entropy at time

*t*+ 1, Consequently, We can evaluate equation 4.12 without doing any high-dimensional integration because the likelihood of the responses only depends on . As a result, Since and is gaussian, ρ

_{t+1}is a one-dimensional gaussian variable with mean and variance . The final result is a very simple, two-dimensional expression for our objective function, The right-hand side of equation 4.14 is an approximation of the mutual information because the posterior is not in fact gaussian.

Equation 4.14 is a fairly intuitive metric for rating the informativeness of different designs. To distinguish among models, we want the response to be sensitive to . The information increases with the sensitivity because as the sensitivity increases, small differences in produce larger differences in the response, making it easier to identify the correct model. The information, however, also depends on the variability of the responses. As the variability of the responses increases, the information decreases because it is harder to determine which model is more accurate. The Fisher information, *D*(*r*_{t+1}, ρ_{t+1}), takes into account both the sensitivity and the variability. As the sensitivity increases, the second derivative of the log likelihood increases because the peak of the log likelihood becomes sharper. Conversely, as the variability increases, the log likelihood becomes flatter, and the Fisher information decreases. Hence, *D*(*r*_{t+1}, ρ_{t+1}) measures the informativeness of a particular response. However, information is valuable only if it tells us something we do not already know. In our objective function, σ^{2}_{ρ} measures our uncertainty about the model. Since our objective function depends on the product of the Fisher information and our uncertainty, our algorithm will favor experiments providing large amounts of new information.

In equation 4.14 we have reduced the mutual information to a two-dimensional integration over ρ_{t+1} and *r*_{t+1}, which depends on (μ_{ρ}, σ^{2}_{ρ}). While 2D numerical integration is quite tractable, it could potentially be too slow for real-time applications. A simple solution is to precompute this function before training begins on a suitable 2D region of (μ_{ρ}, σ^{2}_{ρ}) and then use a lookup table during our experiments.

*x*) =

*x*+

*o*(

*x*) when

*D*(

*r*

_{t+1}, ρ

_{t+1})σ

^{2}

_{ρ}is sufficiently small. Using this linear approximation, we can simplify equation 4.14 to which may be evaluated analytically in some special cases (see below). If is constant, then this approximation is always justified asymptotically because the variance in all directions asymptotically converges to zero (see section 7). Consequently, σ

^{2}

_{ρ}→ 0 as

*t*→ ∞. Therefore, if

*D*(

*r*

_{t+1}, ρ

_{t+1}) is bounded, then asymptotically

*D*(

*r*

_{t+1}, ρ

_{t+1})σ

^{2}

_{ρ}→ 0.

### 4.1. Special Case: Exponential Nonlinearity.

*f*() is the exponential function, we can derive an analytical approximation for the mutual information, equation 4.14, because the Fisher information is independent of the observation. This special case is worth considering because the exponential nonlinearity has proved adequate for modeling several types of neurons in the visual system (Chichilnisky, 2001; Pillow, Paninski, Uzzell, Simoncelli, & Chichilnisky, 2005; Rust, Mante, Simoncelli, & Movshon, 2006). As noted in the previous section, the Fisher information depends on the variability and sensitivity of the responses to the model parameters. In general, the Fisher information depends on the response because we can use it to estimate the variability and sensitivity of the neuron's responses. For the Poisson model with convex and increasing

*f*(),

^{1}a larger response indicates more variability but also more sensitivity of the response to ρ

_{t+1}. For the exponential nonlinearity, the decrease in information due to increased variability and the increase in information due to increased sensitivity with the response cancel out, making the Fisher information independent of the response. Mathematically this means the second derivative of the log likelihood with respect to is independent of

*r*

_{t+1}, By eliminating the expectation over

*r*

_{t+1}and using the linear approximation log(1 +

*x*) =

*x*+

*o*(

*x*), we can simplify equation 4.14: We can use the moment-generating function of a gaussian distribution to evaluate this expectation over ρ

_{t+1}: Our objective function is increasing with μ

_{ρ}and σ

^{2}

_{ρ}. In section 5.2, we show that this property makes optimizing the design for an exponential nonlinearity particularly tractable.

### 4.2. Linear Model.

The optimal design for minimizing the posterior entropy of for the standard linear model is a well-known result in the statistics and experimental design literature (MacKay, 1992; Chaloner & Verdinelli, 1995). It is enlightening to rederive these results using the methods we have introduced so far and to point out some special features of the standard linear case.

^{2}, is constant. Furthermore, the sensitivity of the responses to the input and the model parameters is also constant. Consequently, the Fisher information is independent of both the response and the input (Chaudhuri & Mykland, 1993). Mathematically this means that the observed Fisher information

*D*(

*r*

_{t+1}, ρ

_{t+1}) is a constant equal to the reciprocal of the variance: Plugging

*D*(

*r*

_{t+1}, ρ

_{t+1}) into equation 4.14, we obtain the simple result: Since σ

^{2}is a constant, we can increase the mutual information only by picking stimuli for which is maximized. Under the power constraint, σ

^{2}

_{ρ}is maximized when all the stimulus energy is parallel to the maximum eigenvector of

*C*_{t}, the direction of maximum uncertainty. μ

_{ρ}does not affect the optimization at all. This property distinguishes the linear model from the exponential Poisson case described above. Furthermore, the covariance matrix

*C*_{t}is independent of past responses because the true posterior is gaussian with covariance matrix: Consequently, the optimal sampling strategy can be determined a priori, without having to observe

*r*or to make any corresponding adjustments in our sampling strategy (MacKay, 1992).

_{t}Like the Poisson model with an exponential link function, the linear model's Fisher information is independent of the response. However, for the linear model, the Fisher information is also independent of the model parameters. Since the Fisher information is independent of the parameters, an adaptive design offers no benefit because we do not need to know the parameters to select the optimal input. In contrast, for the Poisson distribution with an exponential link function, the Fisher information depends on the parameters and the input, even though it is independent of the responses. As a result, we can improve our design by adapting it as our estimate of improves.

## 5. Choosing the Optimal Stimulus

_{ρ}, σ

^{2}

_{ρ}) back into the input space. In general, computing for arbitrary stimulus domains is the hardest step.

We first present a general procedure for handling arbitrary stimulus domains. This procedure selects the optimal stimulus from a set, , which is a subset of . contains a finite number of inputs; its size will be denoted . Picking the optimal input in is easy. We simply compute (μ_{ρ}, σ^{2}_{ρ}) for each .

Picking the optimal stimulus in a finite set, , is flexible and straightforward. The informativeness of the resulting design, however, is highly dependent on how is constructed. In particular, we want to ensure that with high probability, contains inputs in that are nearly optimal. If we could compute , then we could avoid the problem of picking a good . One case in which we can compute is when is defined by a power constraint; that is, is a sphere. Since we can compute , we can optimize the input over its full domain. Unfortunately, our method for computing cannot be applied to arbitrary input domains.

### 5.1. Optimizing over a Finite Set of Stimuli.

Our first method simultaneously addresses two issues: how to deal with arbitrary stimulus domains and what to do if the stimulus domain is ill defined. In general, we expect that more efficient procedures for mapping a stimulus domain into could be developed by taking into account the actual stimulus domain. However, a generalized procedure is needed because efficient algorithms for a particular stimulus domain may not exist, or their development may be complex and time-consuming. Furthermore, for many stimulus domains (i.e., natural images), we have many examples of the stimuli but no quantitative constraints that define the domain. An obvious solution to both problems is to simply choose the best stimulus from a subset of examples, .

The challenge with this approach is picking the set . For the optimization to be fast, needs to be sufficiently small. However, we also want to ensure that contains an optimal or nearly optimal input. In principle, this second criterion means should contain a large number of stimuli evenly dispersed over . We can in fact satisfy both requirements because the informativeness of a stimulus depends on only (μ_{ρ}, σ^{2}_{ρ}). Consequently, we can partition into sets of equally informative experiments based on the value of (μ_{ρ}, σ^{2}_{ρ}). When constructing , there is no reason to include more than one input for each value of (μ_{ρ}, σ^{2}_{ρ}) because all of these inputs are equally informative. Hence, to ensure that contains a nearly optimal input, we just need its stimuli to span the two-dimensional and not the much higher-dimensional space, .

Although and *C*_{t} change with time, these quantities are known when optimizing . Hence, the mapping is known and easy to evaluate for any stimulus. We can use this knowledge to develop simple heuristics for selecting inputs that tend to be dispersed throughout . We delay until sections 5.3 and 6.1 the presentation of the heuristics that we used in our simulations so that we can first introduce the specific problems and stimulus domains for which these heuristics are suited.

### 5.2. Power Constraint.

Ideally, we would like to optimize the input over its full domain as opposed to restricting ourselves to a subset of inputs. Here we present a method for computing when is defined by the power constraint .^{2} This is an important stimulus domain because of its connection to white noise, which is often used to study sensory systems (Eggermont, 1993; Cottaris & De Valois, 1998; Chichilnisky, 2001; Dayan Abbot, 2001; Wu, David, & Gallant, 2006). Under an i.i.d. design, the stimuli sampled from resemble white noise. The primary difference is that we strictly enforce the power constraint, whereas for white noise, the power constraint applies only to the average power of the input. The domain is also worth considering because it defines a large space that includes many important subsets of stimuli such as random dot patterns (DiCarlo, Johnson, & Hsiao, 1998).

_{ρ}, σ

^{2}

_{ρ}is continuous on the interval between its maximum and minimum values. These properties of mean we can compute the boundary of by maximizing and minimizing σ

^{2}

_{ρ}as a function of μ

_{ρ}. consists of all points on this boundary as well as the points enclosed by this curve (Berkes & Wiskott, 2005):

By solving equations 5.6 and 5.7, we can walk along the curves that define the upper and lower boundaries of as a function of μ_{ρ}. To move along these curves, we simply adjust the value of the linear constraint. As we walk along these curves, the quadratic constraint ensures that we do not violate the power constraint that defines the stimulus domain.

^{2}

_{ρ}in terms of , Here we discuss only the most important points regarding equation 5.8; the derivation and definition of the terms are provided in appendix A. The linear term of this modified quadratic expression ensures that the value of this expression is independent of the projection of on . The constant term ensures that the value of this expression equals the value of σ

^{2}

_{ρ}if we forced the projection of on to μ

_{ρ}. Maximizing and minimizing σ

^{2}

_{ρ}subject to linear and quadratic constraints is therefore equivalent to maximizing and minimizing this modified quadratic expression with just the quadratic constraint.

To maximize and minimize equation 5.8 subject to the quadratic constraint , we use the Karush-Kuhn-Tucker (KKT) conditions. For these optimization problems, it can be proved that the KKT are necessary and sufficient (Fortin, 2000). To compute the boundary of as a function of μ_{ρ}, we need to solve the KKT for each value of μ_{ρ}. This approach is computationally expensive because for each value of μ_{ρ}, we need to find the value of the Lagrange multiplier by finding the root of a nonlinear function. We have devised a much faster solution based on computing μ_{ρ} as a function of the Lagrange multiplier; the details are in appendix A. This approach is faster because to compute μ_{ρ} as a function of the Lagrange multiplier, we need only find the root of a 1D quadratic expression.

To solve the KKT conditions, we need the eigendecomposition of *A*. Computing the eigendecomposition of *A* is the most expensive operation and, in the worst case, requires *O*(*d*^{3}) operations. *A*, however, is a rank 2 perturbation of *C*_{t}, equation A.11. When these perturbations are orthogonal to some of the eigenvectors of *C*_{t}, we can reduce the number of computations needed to compute the eigendecomposition of *C*_{t} by using the Gu-Eisenstat algorithm (Gu & Eisenstat, 1994), as discussed in the next section. The key point is that we can on average compute the eigendecomposition in *O*(*d*^{2}) time.

_{ρ}, σ

^{2}

_{ρ})*, which maximizes the mutual information, thereby completing step 1 in our program. To finish the program, we need to find an input such that and . We can easily find one solution by solving a one-dimensional quadratic equation. Let and denote the inputs corresponding to (μ*

_{ρ}, σ

^{2}

_{ρ min}) and (μ*

_{ρ}, σ

^{2}

_{ρ max}), respectively. These inputs are automatically computed when we compute the boundary of . To find a suitable , we find a linear combination of these two vectors that yields σ

^{2}*

_{ρ}: All necessarily satisfy the power constraint because it defines a convex set, and is a linear combination of two stimuli in this set. Similar reasoning guarantees that has projection μ*

_{ρ}on . Although this maximizes the mutual information with respect to the full stimulus domain under the power constraint, this solution may not be unique. Finding γ completes the optimization of the input under the power constraint.

^{2}

_{ρ}, then we need to consider only σ

^{2}

_{ρ ,max}(μ

_{ρ}) for each possible value of μ

_{ρ}. Consequently, a one-dimensional search over σ

^{2}

_{ρ ,max}(μ

_{ρ}) for is sufficient for finding the optimal input. A sufficient condition for guaranteeing that the mutual information increases with σ

^{2}

_{ρ}is convexity of in ρ

_{t+1}(see appendix B). An important example satisfying this condition is

*f*(ρ

_{t+1}) = exp(ρ

_{t+1}), which satisfies the convexity condition because

### 5.3. Heuristics for the Power Constraint.

Although we can compute when , efficient heuristics for picking subsets of stimuli are still worth considering. If the size of the subset of stimuli is small enough, then computing (μ_{ρ}, σ^{2}_{ρ}) for each stimulus in the subset is usually faster than computing for the entire stimulus domain. Since we can set the size of the set to any positive integer, by decreasing the size of the set we can sacrifice accuracy, in terms of finding the optimal stimulus, for speed.

We developed a simple heuristic for constructing finite subsets of by taking linear combinations of the mean and maximum eigenvector. To construct a subset, , of the closed ball, we use the following procedure:

Generate a random number, ω, uniformly from the interval [ −

*m*,*m*], where*m*^{2}is the stimulus power.Generate a random number, ϕ, uniformly from the interval .

Add the input to , where . is the maximum eigenvector of

*C*_{x,t}.

This procedure tends to produce a set of stimuli that are dispersed throughout . By varying the projection of along the MAP, the heuristic tries to construct a set of stimuli for which the values of μ_{ρ} are uniformly distributed on the valid interval. Similarly, by varying the projection of each stimulus along the maximum eigenvector, we can adjust the value of σ^{2}_{ρ} for each stimulus. Unfortunately, the subspace of the stimulus domain spanned by the mean and max eigenvector may not contain the stimuli that map to the boundaries of . Nonetheless, since this heuristic produces stimuli that tend to be dispersed throughout , we can usually find a stimulus in that is close to being optimal.

When the mutual information is increasing with σ^{2}_{ρ}, we can easily improve this heuristic. In this case, the optimal stimulus always lies on the sphere . Therefore, when constructing the stimuli in a finite set, we should pick only stimuli that are on this sphere. To construct such a subset, , we use the heuristic above except we set . Since the mutual information for the exponential Poisson model is increasing with σ^{2}_{ρ}, our simulations for this model will always use as opposed to .

We could also have constructed subsets of the stimulus domain, , by uniformly sampling the ball or sphere. Unfortunately, this process produces sets that rarely contain highly informative stimuli, particularly in high dimensions. Since the uniform distribution on the sphere is radially symmetric, and the covariance matrix of is diagonal with entries . As a result, the variance of μ_{ρ}, decreases as 1/*d*, ensuring that for high-dimensional systems, the stimuli in have μ_{ρ} close to zero with high probability (see Figure 4). Uniformly sampling the ball or sphere therefore does a poor job of selecting stimuli that are dispersed throughout . As a result, is unlikely to contain stimuli close to being maximally informative.

### 5.4. Simulation Results.

We tested our algorithm using computer simulations that roughly emulated typical neurophysiology experiments. The main conclusion of our simulations is that using our information-maximizing (infomax) design, we can reduce by an order of magnitude the number of trials needed to estimate (Paninski, 2005). This means we can increase the complexity of neural models without having to increase the number of data points needed to estimate the parameters of these higher-dimensional models. Furthermore, our results show that we can perform the computations fast enough—between 10 m and 1 sec depending on —that our algorithm could be used online, during an experiment, without requiring expensive or custom hardware.

Our first simulation used our algorithm to learn the receptive field of a visually sensitive neuron. The simulation tested the performance of our algorithm with a high-dimensional input space. We took the neuron's receptive field to be a Gabor function as a proxy model of a V1 simple cell (Ringach, 2002). We generated synthetic responses by sampling equation 2.3 with set to a 40 × 40 Gabor patch. The nonlinearity was the exponential function.

Plots of the posterior means (recall these are equivalent to the MAP estimate of ) for several designs are shown in Figure 5. The results show that all infomax designs do better than an i.i.d. design, and an infomax design that optimizes over the full domain of the input, , does much better than choosing the best stimulus in a subset constructed by uniformly sampling .

The results in Figures 5 and 6 show that if we choose the optimal stimulus from a finite set, then intelligently constructing the set is critical to achieving good performance. We compared two approaches for creating the set when . The first approach selected a set of stimuli, , by uniformly sampling . The second approach constructed a set for each trial using the heuristic presented in section 5.3. Picking the optimal stimulus in produced much better estimates of than picking the optimal stimulus in . In particular, the design using converged to nearly as fast as the design that optimized over the full stimulus domain, . These results show that using is more efficient than reusing the same set of stimuli for all trials. To achieve comparable results using , we would have to increase the number of stimuli by several orders of magnitude. Consequently, the added cost of constructing a new stimulus set after each trial is more than offset by our ability to use fewer stimuli compared to using a constant set of stimuli.

We also compared the infomax designs to the limiting cases where we put all stimulus energy along the mean or maximum eigenvector (see Figures 5 and 6). Putting all energy along the maximum eigenvector performs nearly as well as an i.i.d. design. Our update, equation 3.12, ensures that if the stimulus is an eigenvector of *C*_{t}, the updated covariance matrix is the result of shrinking the eigenvalue corresponding to that eigenvector. Consequently, setting the stimulus to the maximum eigenvector ends up scanning through the different eigenvectors on successive trials. The resulting sequence of stimuli is statistically similar to that of an i.i.d. design because the stimuli are highly uncorrelated with each other and with . As a result, both methods generate similar marginal distributions with sharp peaks at 0. Since the Fisher information of a stimulus under the power constraint varies only with , both methods pick stimuli that are roughly equally informative. Consequently, both designs end up shrinking the posterior entropy at similar rates.

In contrast, making the stimulus on each trial parallel to the mean leads to a much slower initial decrease of the posterior entropy. Since our initial guess of the mean is highly inaccurate, is close to zero, resulting in a small value for the Fisher information. Furthermore, sequential stimuli end up being highly correlated. As a result, we converge very slowly to the true parameters.

We also evaluated a design that used sinusoidal gratings as the stimuli. In Figure 5, this design produces an estimate of that already has the basic inhibitory and excitatory pattern of the receptive field after just 1000 trials. However, on the remaining trials improves very little. Figure 6 shows that this design decreases the entropy at roughly the same rate as the i.i.d. design. The reason the coarse structure of the receptive field appears after so few trials is that the stimuli have a large amount of spatial correlation. This spatial correlation among the stimuli induces a similar correlation among the components of the MAP and explains why the coarse inhibitory and excitatory pattern of the receptive field appears after so few trials. However, it also makes it difficult to estimate the higher-resolution features of , which is why does not improve much between 1000 and 5000 trials.

Similar results to Figure 5 in Paninski (2005) used a brute force computation and optimization of the mutual information. The computation in Paninski (2005) was possible only because was assumed to be a Gabor function specified by just three parameters (the 2D location of its center and its orientation). Similarly, the stimuli were constrained to be Gabor functions. Our simulations did not assume that or was Gabor. could have been any 40 × 40 image with power *m*^{2}. Attempting to use brute force in this high-dimensional space would have been hopeless. Our results show that a sequential optimal design allows us to perform system identification in high-dimensional spaces that might otherwise be tractable only by making strong assumptions about the system.

The fact that we can pick the stimulus to increase the information about the parameters, , that determine the dependence of the firing rate on the stimulus is unsurprising. Since we are free to pick any stimulus, by choosing an appropriate stimulus we can distinguish among different values of . Our GLM, however, can also include spike history terms. Since we cannot fully control the spike history, a reasonable question is whether infomax can improve our estimates of the spike history coefficients, . Figure 7 shows the results of a simulation characterizing the receptive field of a neuron whose response depends on its past spiking. The unknown parameter vector, , consists of the stimulus coefficients , which were a 1D Gabor function, and the spike history coefficients, , which were inhibitory and followed an exponential function. The nonlinearity was the exponential function.

The results in Figure 7 show that an infomax design leads to better estimates of both and . Figure 7 shows the MAPs of both methods on different trials, as well as the mean squared error (MSE) on all trials. In Figure 7, the MSE increases on roughly the first 100 trials because the mean of the prior is zero. The data collected on these early trials tend to increase the magnitude of . Since the true direction of is still largely unknown, the increase in the magnitude of tends to increase the MSE.

By converging more rapidly to the stimulus coefficients, the infomax design produces a better estimate of how much of the response is due to , which leads to better estimates of . The size of this effect is measured by the correlation between and , which is given by *C*_{x,f} in equation A.3. Consider a simple example where the first entry of *C*_{x,f} is negative and the remaining entries are zero. In this example, and (the first components of and , respectively) would be anticorrelated. This value of *C*_{x,f} roughly means that the log posterior remains relatively constant if we increase but decrease . If we knew the value of , then we would know where along this line of equal probability the true parameters were located. As a result, increasing our knowledge about also reduces our uncertainty about .

#### 5.4.1. Running Time.

Our algorithm is suited to high-dimensional, real-time applications because it reduces the exponential complexity of choosing the optimal design to on average quadratic and at worst cubic running time. We verified this claim empirically by measuring the running time for each step of the algorithm as a function of the dimensionality of , Figure 8a.^{3} These simulations used a GLM with an exponential link function. This nonlinearity leads to a special case of our algorithm because we can derive an analytical approximation of our objective function, equation 4.20, and only a one-dimensional search in is required to find the optimal input. These properties facilitate implementation but do not affect the complexity of the algorithm with respect to *d*. Using a lookup table instead of an analytical expression to estimate the mutual information as a function of (μ_{ρ}, σ^{2}_{ρ}) would not change the running time with respect to *d* because is always 2D. Similarly, the increased complexity of a full 2D search compared to a 1D search in is independent of *d*.

The main conclusion of Figure 8a is that the complexity of our algorithm on average grows quadratically with the dimensionality. The solid black line shows a polynomial of degree 2 fitted to the total running time. We also measured the running time of the four steps that make up our algorithm: (1) updating the posterior, (2) computing the eigendecomposition of the covariance matrix, (3) modifying the quadratic form for σ^{2}_{ρ} to eliminate the linear constraint (i.e., finding the eigendecomposition of *A* in equation A.11) and (4) finding the optimal stimulus. The solid lines indicate fitted polynomials of degree 1 for optimizing the stimulus and degree 2 for the remaining curves. Optimizing the stimulus entails searching along the upper boundary of for the optimal pair (μ*_{ρ}, σ^{2}*_{ρ}) and then finding an input that maps to (μ*_{ρ}, σ^{2}*_{ρ}). The running time of these operations scales as *O*(*d*) because computing σ^{2}_{ρ ,max} as a function of λ requires summing *d* terms, equation A.17. When was 100 dimensions, the total running time was about 10 ms, which is within the range of tolerable latencies for many experiments. Consequently, these results support our conclusion that our algorithm can be used in high-dimensional, real-time applications.

*O*(

*d*

^{3}). Figure 8a, however, shows that the average running time of the eigendecomposition grows only quadratically with the dimensionality. The average running time grows as

*O*(

*d*

^{2}) because most of the eigenvectors remain unchanged after each trial. The covariance matrix after each trial is a rank 1 perturbation of the covariance matrix from the previous trial, and every eigenvector orthogonal to the perturbation remains unchanged. A rank 1 update can be written as where

*M*and

*M*′ are the old and perturbed matrices, respectively. Clearly, any eigenvector, , of

*M*orthogonal to the perturbation, , is also an eigenvector of

*M*′ because where

*c*is the eigenvalue corresponding to .

If the perturbation leaves most of our eigenvectors and eigenvalues unchanged, then we can use the Gu-Eisenstat algorithm to compute fewer than *d* eigenvalues and eigenvectors, thereby achieving on average quadratic running time (Gu & Eisenstat, 1994; Demmel, 1997; Seeger, 2007). Asymptotically, we can prove that the perturbation is correlated with at most two eigenvectors (see section 7). Consequently, asymptotically we need to compute at most two new eigenvectors on each trial. These asymptotic results, however, are not as relevant for the actual running time as empirical results. In Figure 8b, we plot, for one simulation, the number of eigenvectors that are perturbed by the rank 1 modification. On most trials, fewer than *d* eigenvectors are perturbed by the update. These results rely to some extent on the fact that our prior covariance matrix was white and hence had only one distinct eigenvalue. On each subsequent iteration, we can reduce the multiplicity of this eigenvalue by at most one. Our choice of prior covariance matrix therefore helps us manage the complexity of the eigendecomposition.

## 6. Important Extensions

In this section we consider two extensions of the basic GLM that expand the range of neurophysiology experiments to which we can apply our algorithm: handling nonlinear transformations of the input and dealing with time-varying . In both cases, our method for picking the optimal stimulus from a finite set requires only slight modifications. Unfortunately, our procedure for picking the stimulus under a power constraint will not work if the input is pushed through a nonlinearity.

### 6.1. Input Nonlinearities.

*W*. These functions map the stimulus into feature space a simple example being the case where the functions

_{i}*W*represent a filter bank.

_{i}*n*denotes the number of nonlinear basis functions used to transform the input. For convenience, we denote the output of these transformations as . As before, our objective is picking the stimulus that maximizes the mutual information about the parameters, . For simplicity, we have assumed that the response does not depend on past stimuli, but this assumption could easily be dropped.

_{w}NLN models are frequently used to explain how sensory systems process information. In vision, for example, MT cells can be modeled as a GLM whose input is the output of a population of V1 cells (Rust et al., 2006). In this model, V1 is modeled as a population of tuning curves whose output is divisively normalized. Similarly in audition, cochlear processing is often represented as a spectral decomposition using gammatone filters (de Boer & de Jongh, 1978; Patterson et al., 1992; Lewicki, 2002; Smith & Lewicki, 2006). NLN models can be used to model this spectral decomposition of the auditory input, as well as the subsequent integration of information across frequency (Gollisch, Schutze, Benda, & Herz, 2002). One of the most important NLN models in neuroscience is the energy model. In vision, energy models are used to explain the spatial invariance of complex cells in V1 (Adelson & Bergen, 1985; Dayan & Abbot, 2001). In audition, energy models are used to explain frequency integration and phase insensitivity in auditory processing (Gollisch et al., 2002; Carlyon & Shamma, 2003).

*Q*, The right-hand side of this expression has more degrees of freedom than our original energy model unless we restrict

^{i}*Q*to be a rank 1 matrix. Letting

^{i}*Q*= ∑

_{i}

*Q*, we can write the energy model as where

^{i}*x*

_{i,t}denotes the

*i*th component of . This model is linear in the matrix coefficients

*Q*

_{i,j}and the products of the stimulus components

*x*

_{i,t}

*x*

_{j,t}. To obtain a GLM, we use the input nonlinearity, , to map to the vector [

*x*

_{1,t}

*x*

_{1,t}, …,

*x*

_{i,t}

*x*

_{j,t}, …]

^{T}. The parameter vector for the energy model is the matrix

*Q*rearranged as a vector , which acts on feature space not stimulus space.

Using the functions, *W _{i}*, to project the input into feature space does not affect our strategy for picking the optimal stimulus from a finite set. We simply have to compute for each stimulus before projecting it into and computing the mutual information. Our solution for optimizing the stimulus under a power constraint, however, no longer works for two reasons. First, a power constraint on does not in general translate into a power constraint on the values of . As a result, we cannot use the algorithm of section 5.2 to find the optimal values of . Second, assuming we could find the optimal values of , we would need to invert to find the actual stimulus. For many nonlinearities, the energy model being one example, is not invertible.

To estimate the parameters of an energy model, we use our existing update method to construct a gaussian approximation of the posterior in feature space, . We can then use the MAP to estimate the input filters . The first step is to rearrange the terms of the mean, , as a matrix, . We then estimate the input filters, , by computing the singular value decomposition (SVD) of . If converges to the true value, then the subspace corresponding to its nonzero singular values should equal the subspace spanned by the true filters, .

_{i,t}is the

*i*th component of .

*r*in this example has no dependence on past responses; hence, we do not need to sum over the past responses to compute μ

_{t}_{ρ}(i.e.,

*t*= 0). is just the MAP, , rearranged as a matrix. We construct each stimulus in as follows:

_{a}We randomly pick an eigenvector, , of with the probability of picking each eigenvector being proportional to the relative energy of the corresponding eigenvalue.

We pick a random number, ω, by uniformly sampling the interval [ −

*m*,*m*], where*m*^{2}is the maximum allowed stimulus power.We choose a direction, , orthogonal to by uniformly sampling the

*d*− 1 unit sphere orthogonal to .

This heuristic works because for the energy model, measures the energy of the stimulus in feature space. For this model, feature space is defined by the eigenvectors of *Q*. Naturally, if we want to increase ρ_{t+1}, we should increase the energy of the stimulus along one of the basis vectors of feature space. The eigenvectors of are our best estimate for the basis vectors of feature space. Hence, μ_{ρ}, the expected value of ρ_{t+1}, varies linearly with the energy of the input along each eigenvector of , equation 6.7.

The effectiveness of our heuristic is illustrated in Figure 10. This figure illustrates the mapping of stimuli into space for stimulus sets constructed using our heuristic, , and stimulus sets produced by uniformly sampling the sphere, . Our heuristic produces a set of stimuli that is more spread out on the range of μ_{ρ}. As a result, contains more informative stimuli than .

#### 6.1.1. Auditory Simulation.

We applied these estimation and optimization procedures to a simulation of an auditory neuron. We modeled the neuron using an energy model. For simplicity, our hypothetical neuron received input from just two neurons in earlier layers. We modeled these input neurons as gammatone filters that were identical except for a 90 degree difference in phase (de Boer & de Jongh, 1978; Patterson et al., 1992). We generated spikes by sampling a conditional Poisson process whose instantaneous, conditional firing rate was set by equation 6.2 with , and being the gammatone filters, and *f*(ρ_{t+1}) = exp(ρ_{t+1}). We estimated the parameters, *Q*, using an i.i.d. and two infomax designs. The i.i.d. design uniformly sampled the stimulus from the sphere . The two infomax designs picked the optimal stimulus in a subset of stimuli drawn from the sphere. In one case, this set was constructed using our heuristic, while in the other case, it was constructed by uniformly sampling the sphere.

The results of our simulations are shown in Figure 11. When finding the MAP of , we restricted such that the corresponding matrices, , were symmetric but not necessarily rank 2. The rank 2 restriction is unnecessary because the number of linear filters can be recovered from the number of nonzero singular values of . To show how well the true gammatone filters can be estimated from the principal components of , we show in Figure 11 the reconstruction of and using the first two principal components of that is, the linear combination of the projections of each filter along the first two principal components.

Figures 11 and 12 show that when the optimal stimulus in is picked, the MAP converges more rapidly to the true gammatone filters. In Figure 11, the design that uses pure tones as the inputs appears to produce good estimates of the filters. These results, however, are somewhat misleading. Since these inputs are restricted to tones, the inputs that cause the neuron to fire are highly correlated. As a result, the estimated receptive field is biased by the correlations in the input. Since gammatone filters are similar to sine waves, in some sense, this bias means that using pure tones will rapidly produce a coarse estimate of the gammatone filters. However, since the pure tones are highly correlated, it is difficult to remove these correlations from the estimated receptive field and resolve the finer structure of the filters. This behavior is evident in Figure 12, which shows that after 1000 trials, the MSE for the pure tones design does not decrease as fast as for the alternative designs.

Also evident in the infomax results is the exploitation-exploration trade-off (Kaelbling, Littman, & Moore, 1996). To increase the information about one of the expected filters, we need to pick stimuli that are correlated with this filter. Since the input filters are orthogonal and the stimulus power is constrained, we can only effectively probe one filter at a time. The exploitation-exploration trade-off explains why on trials 100 to 500, the estimate of the first filter improves much more than the second filter. On these trials, the algorithm exploits its knowledge of the first filter rather than searching for other filters. After roughly 500 trials, exploring becomes more rewarding than exploiting our estimate of . Hence, the infomax design picks stimuli orthogonal to the first gammatone filter, which eventually leads to our finding the second filter.

### 6.2. Time-Varying .

To update the posterior and choose the optimal stimulus, we use the procedures described in sections 3 and 5. The only difference due to a time-varying is that the covariance matrix of is in general no longer just a rank 1 modification of the covariance matrix of . Therefore, we cannot use the rank 1 update to compute the eigendecomposition. However, since we may not have any a priori knowledge about the direction of changes in , it is often reasonable to assume has mean zero and white covariance matrix, Π = *cI*. In this case, the eigenvectors of *C*_{t} + Π are those of *C*_{t}, and the eigenvalues are *c _{i}* +

*c*where

*c*is the

_{i}*i*th eigenvalue of

*C*_{t}; in this case, our methods may be applied without modification. In cases where we expect that varies systematically, we could try to model those dynamics more accurately by selecting an appropriate mean and covariance matrix for .

Figure 13 shows the results of using an infomax design to fit a GLM to a neuron whose receptive field drifts nonsystematically with time. The receptive field was a one-dimensional Gabor function whose center moved according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). Although only the center of moved, we still modeled changes in using equation 6.10. The results demonstrate the benefits of using an infomax design to estimate a time-varying . Although we cannot reduce our uncertainty below a level determined by Π, the infomax design can still improve our estimate of compared to using random stimuli.