Abstract

Adaptively optimizing experiments has the potential to significantly reduce the number of trials needed to build parametric statistical models of neural systems. However, application of adaptive methods to neurophysiology has been limited by severe computational challenges. Since most neurons are high-dimensional systems, optimizing neurophysiology experiments requires computing high-dimensional integrations and optimizations in real time. Here we present a fast algorithm for choosing the most informative stimulus by maximizing the mutual information between the data and the unknown parameters of a generalized linear model (GLM) that we want to fit to the neuron's activity. We rely on important log concavity and asymptotic normality properties of the posterior to facilitate the required computations. Our algorithm requires only low-rank matrix manipulations and a two-dimensional search to choose the optimal stimulus. The average running time of these operations scales quadratically with the dimensionality of the GLM, making real-time adaptive experimental design feasible even for high-dimensional stimulus and parameter spaces. For example, we require roughly 10 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Despite using some approximations to make the algorithm efficient, our algorithm asymptotically decreases the uncertainty about the model parameters at a rate equal to the maximum rate predicted by an asymptotic analysis. Simulation results show that picking stimuli by maximizing the mutual information can speed up convergence to the optimal values of the parameters by an order of magnitude compared to using random (nonadaptive) stimuli. Finally, applying our design procedure to real neurophysiology experiments requires addressing the nonstationarities that we would expect to see in neural responses; our algorithm can efficiently handle both fast adaptation due to spike history effects and slow, nonsystematic drifts in a neuron's activity.

1.  Introduction

In most neurophysiology experiments, data are collected according to a design that is finalized before the experiment begins. During the experiment, the data already collected are rarely analyzed to evaluate the quality of the design. These data, however, often contain information that could be used to redesign experiments to better test hypotheses (Fedorov, 1972; Chaloner & Verdinelli, 1995; Kontsevich & Tyler, 1999; Warmuth et al., 2003; Roy, Ghosal, & Rosenberger in press). Adaptive experimental designs are particularly valuable in domains where data are expensive or limited. In neuroscience, experiments often require training and caring for animals, which can be time-consuming and costly. As a result of these costs, neuroscientists are often unable to conduct large numbers of trials using different subjects. The inability to collect enough data makes it difficult for them to investigate high-dimensional, complex neural systems. By using adaptive experimental designs, neuroscientists could potentially collect data more efficiently. In this article, we develop an efficient algorithm for optimally adapting the experimental design in one class of neurophysiology experiments.

A central question in neuroscience is understanding how neural systems respond to different inputs. For sensory neurons, the input might be sounds or images transduced by the organism's receptors. More generally, the stimulus could be a chemical or electrical signal applied directly to the neuron. Neurons often respond nonlinearly to these stimuli because their activity will typically adapt or saturate. We can model these nonlinearities by viewing a neuron's firing rate as a variable dependent on its past activity in addition to recent stimuli. To model the dependence on past stimuli and responses, we define the input as a vector comprising the current and recent stimuli, , as well as the neuron's recent activity, (Keat, Reinagel, Reid, & Meister, 2001; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005). and rt denote the stimulus and firing rate at time t, respectively. When we optimize the input for time t + 1, we can control only , as the rest of the components of the input (i.e., past stimuli and responses) are fixed. To distinguish the controllable and fixed components of the input, we use the subscripts x and f:
formula
1.1
formula
1.2
formula
1.3
is the input at time t. is a vector comprising the past stimuli and responses on which the response at time t depends. tk and ta are how far back in time the dependence on the stimuli and responses stretches (i.e., if tk = 0 and ta = 0, then ). Not all models will include a dependence on past stimuli or responses; the values of tk and ta will depend on the model adopted for a particular experiment.

We can describe a model that incorporates all of these features by specifying the conditional distribution of the responses given the input. This distribution gives the probability of observing response rt at time t given the input . We use a distribution as opposed to a deterministic function to specify the relationship between rt and because a neuron's response varies for repeated presentations of a stimulus. To simplify the model, we restrict our consideration to parametric distributions that lie in some space Θ. Each vector denotes a particular model in this space. To fit a model, , to a neuron, we need to find the best value of .

We estimate by observing the neuron's response to various stimuli. For these experiments, the design is a procedure for picking the stimulus on each trial. The design can be specified as a probability distribution, , from which we sample the stimulus on each trial. Nonrandom designs can be specified by putting all the probability mass on a single stimulus. A sequential design modifies this distribution after each observation. In contrast, the standard nonsequential approach is to fix this distribution before the experiment starts, and then select the stimulus on each trial by drawing independent and identically distributed (i.i.d.) samples from . Figure 1 provides a schematic of the sequential approach we want to implement, as well as a diagram of the typical i.i.d. design.

Figure 1:

(a) Schematic of the process for designing information-maximizing (infomax) experiments. Stimuli are chosen by maximizing the mutual information between the data and the parameters. Since the mutual information depends on the posterior distribution on , the infomax algorithm updates the posterior after each trial. (b) Schematic of the typical independent and identically distributed (i.i.d.) design of experiments. Stimuli are selected by drawing i.i.d. samples from a distribution that is chosen before the experiment starts. An i.i.d. design does not use the posterior distribution to choose stimuli.

Figure 1:

(a) Schematic of the process for designing information-maximizing (infomax) experiments. Stimuli are chosen by maximizing the mutual information between the data and the parameters. Since the mutual information depends on the posterior distribution on , the infomax algorithm updates the posterior after each trial. (b) Schematic of the typical independent and identically distributed (i.i.d.) design of experiments. Stimuli are selected by drawing i.i.d. samples from a distribution that is chosen before the experiment starts. An i.i.d. design does not use the posterior distribution to choose stimuli.

We want to design our experiments to facilitate identification of the best model in Θ. Based on this objective, we define the optimal design for each trial as the design that provides the most information about . A natural metric for the informativeness of a design is the mutual information between the data and the model (Lindley, 1956; Bernardo, 1979; Watson & Pelli, 1983; Cover & Thomas, 1991; MacKay, 1992; Chaloner & Verdinelli, 1995; Paninski, 2005),
formula
1.4
The mutual information measures how much we expect the experimental data to reduce our uncertainty about . The mutual information is a function of the design because it depends on the joint probability of the data, , which obviously depends on how we pick the stimuli. We can determine the optimal design by maximizing the mutual information with respect to the marginal distribution .

Designing experiments by maximizing the mutual information is computationally challenging. The information we expect to gain from an experiment depends on what we have already learned from past observations. To extract the information from past observations, we need to compute the posterior distribution after each trial. Once we have updated the posterior, we need to use it to compute the expected information gain from future experiments; this requires a high-dimensional integration over the space Θ. Maximizing this integral with respect to the design requires a nonlinear search over the high-dimensional stimulus space, . In sensory neurophysiology, the stimulus space is high-dimensional because the stimuli tend to be complex, spatiotemporal signals like movies and sounds. The challenge of evaluating this high-dimensional integral and solving the resulting nonlinear optimization has impeded the application of adaptive experimental design to neurophysiology. In the worst case, the complexity of these operations will grow exponentially with the dimensionality of and . For even moderately sized spaces, direct computation will therefore be intractable, particularly if we wish to adapt the design in a real-time application.

The main contribution of this article is to show how these computations can be performed efficiently when Θ is the space of generalized linear models (GLM) and the posterior distribution on is approximated as a gaussian. Our solution depends on some important log-concavity and rank-one properties of our model. These properties justify the gaussian approximation of the posterior distribution and permit a rapid update after each trial. These properties also allow optimization of the mutual information to be approximated by a tractable two-dimensional problem that can be solved numerically. The solution to this 2D optimization problem depends on the stimulus domain. When the stimulus domain is defined by a power constraint, we can easily find the nearly optimal design. For arbitrary stimulus domains, we present a general algorithm for selecting the optimal stimulus from a finite subset of stimuli in the domain. Our analysis leads to efficient heuristics for constructing this subset to ensure the resulting design is close to the optimal design.

Our algorithm facilitates estimation of high-dimensional systems because picking more informative designs leads to faster convergence to the best model of the neuron. In our simulations (see section 5.4), the optimal design converges more than an order of magnitude faster than an i.i.d. design. Our algorithm can be applied to high-dimensional, real-time applications because it reduces the complexity with respect to dimensionality from exponential to on average quadratic running time.

This article is organized as follows. In section 2, we present the GLM of neural systems. In section 3, we present an online method for computing a gaussian approximation of the posterior distribution on the GLM's parameters. In section 4, we show how the mutual information, , can be approximated by a much simpler, low-dimensional function. In section 5, we present the procedure for picking optimal stimuli and show some simulation results. In section 6, we generalize our basic methods to some important extensions of the GLM needed to handle more complicated experiments. In section 7, we show that our algorithm asymptotically decreases the uncertainty about at a rate nearly equal to the optimal rate predicted by a general theorem on the rate of convergence of information maximizing designs (Paninski, 2005). We therefore conclude that this efficient (albeit approximate) implementation produces designs that are in fact asymptotically optimal. Simulations investigating the issue of model misspecification are presented in section 8. Finally, we discuss some limitations and directions for future work in section 9. To help the reader, we summarize in Table 1 the notation that we will use in the rest of the article.

Table 1:
Definitions of Symbols and Conventions Used Throughout the Article.
graphic
graphic

2.  The Parametric Model

For the model space, Θ, we choose the set of generalized linear models (GLM) (see Figure 2). The GLM is a tractable and flexible parametric family that has proven useful in neurophysiology (McCullagh & Nelder, 1989; Simoncelli, Paninski, Pillow, & Schwartz, 2004; Paninski, 2004; Truccolo et al., 2005; Paninski, Pillow, & Lewi, 2007). GLMs are fairly natural from a physiological point of view, with close connections to biophysical models such as the integrate-and-fire cell. Consequently, they have been applied in a wide variety of experimental settings (Brillinger, 1988, 1992; Chichilnisky, 2001; Theunissen et al., 2001; Paninski, Shoham, Fellows, Hatsopoulos, & Donoghue, 2004).

Figure 2:

Diagram of a general linear model of a neuron. A GLM consists of a linear filter followed by a static nonlinearity. The output of this cascade is the estimated, instantaneous firing rate of a neuron. The unknown parameters are the linear filters applied to the stimulus and spike history.

Figure 2:

Diagram of a general linear model of a neuron. A GLM consists of a linear filter followed by a static nonlinearity. The output of this cascade is the estimated, instantaneous firing rate of a neuron. The unknown parameters are the linear filters applied to the stimulus and spike history.

A GLM model represents a spiking neuron as a point process. The likelihood of the response, the number of spikes, depends on the firing rate, λt, which is a nonlinear function of the input,
formula
2.1
As noted earlier, the response at time t depends on the current stimulus, , as well as past stimuli and responses. The inclusion of spike history in the input means we can account for refractory effects, burstiness, and firing-rate adaptation (Berry & Meister, 1998; Keat et al., 2001; Paninski, 2004; Truccolo et al., 2005). As noted earlier, we use subscripts to distinguish the components that we can control from those that are fixed (see Table 1).
The parameters of the GLM are the coefficients of the filter, , applied to the input. can be separated into two filters , which are applied to the variable and fixed components of the input, respectively. After filtering the input by , the output of the filter is pushed through a static nonlinearity, f(), known as the link function. The input-output relationship of the neuron is fully specified by the log likelihood of the response given the input and ,
formula
2.2
formula
2.3
dt is the length of the time window over which we measure the firing rate, rt. The constant term is constant with respect to but not rt. In this article, we always use a Poisson distribution for the conditional likelihood, , because it is the best one for modeling spiking neurons. However, by making some minor modifications to our algorithm, we can use it with other distributions in the exponential family (Lewi, Butera, & Paninski, 2007).
To ensure the maximum a posteriori (MAP) estimate of is unique, we restrict the GLM so that the log likelihood is always concave. When is a Poisson distribution, a sufficient condition for concavity of the log likelihood is that the nonlinearity f() is a convex and log concave function (Wedderburn, 1976; Haberman, 1977; McCullagh & Nelder, 1989; Paninski, 2004). f() can be convex and log concave only if its contours are linear. When the contours are linear, we can, without loss of generality, assume that f() is a function of a scalar variable, ρt. ρt is the result of applying the linear filter of the GLM to the input,
formula
2.4
Since ρt is a scalar, must be a vector and not a matrix. Convexity of f() also guarantees that the nonlinearity is monotonic. Since we can always multiply by negative 1 (i.e., flip our coordinate system), we can without loss of generality assume that f is increasing. Furthermore, we assume f() is known, although this condition could potentially be relaxed. Knowing f() exactly is not essential because previous work (Li & Duan, 1989; Paninski, 2004) and our own results, (see section 8) indicate that the parameters of a GLM can often be estimated, at least up to a scaling factor, even if the link function is incorrect.

3.  Representing and Updating the Posterior

Our first computational challenge is representing and updating the posterior distribution on the parameters, . We use a fast, sequential procedure for constructing a gaussian approximation of the posterior, (see Figure 3). This gaussian approximation leads to an update that is both efficient and accurate enough to be used online for picking optimal stimuli.

Figure 3:

Schematic illustrating the procedure for recursively constructing the gaussian approximation of the true posterior; . The images are contour plots of the log prior, log likelihoods, log posterior, and log of the gaussian approximation of the posterior (see text for details). The key point is that since is one-dimensional with respect to , when we approximate the log posterior at time t using our gaussian approximation, , we need to do only a one-dimensional search to find the peak of the log posterior at time t. The gray and black dots in the figure illustrate the location of and , respectively.

Figure 3:

Schematic illustrating the procedure for recursively constructing the gaussian approximation of the true posterior; . The images are contour plots of the log prior, log likelihoods, log posterior, and log of the gaussian approximation of the posterior (see text for details). The key point is that since is one-dimensional with respect to , when we approximate the log posterior at time t using our gaussian approximation, , we need to do only a one-dimensional search to find the peak of the log posterior at time t. The gray and black dots in the figure illustrate the location of and , respectively.

A gaussian approximation of the posterior is justified by the fact that the posterior is the product of two smooth, log-concave terms—the GLM likelihood function and the prior (which we assume to be gaussian, for simplicity). As a result, the log posterior is concave (i.e., it always curves downward) and can be well approximated by the quadratic expression for the log of a gaussian. Furthermore, the main result of Paninski (2005) is a central limit-like theorem for optimal experiments based on maximizing the mutual information. This theorem guarantees that asymptotically, the gaussian approximation of the posterior will be accurate.

We recursively construct a gaussian approximation to the posterior by first approximating the posterior using our posterior from the previous trial (see Figure 3). Since the gaussian approximation of the posterior at time t − 1, , summarizes the information in the first t − 1 trials, we can use this distribution to approximate the log posterior after the tth trial,
formula
3.1
formula
3.2
formula
3.3

We fit the log of a gaussian to the approximation of the log posterior in equation 3.2, using the Laplace method (Berger, 1985; MacKay, 2003). This recursive approach is much faster, albeit slightly less accurate, then using the Laplace method to fit a gaussian distribution to the true posterior. The running time of this recursive update is O(d2), whereas fitting a gaussian distribution to the true posterior is O(td3). Since t and d are large, easily O(103), the computational savings of the recursive approach are well worth the slight loss of accuracy. If the dimensionality is low, d = O(10), we can measure the error by using Monte Carlo methods to compute the Kullback-Leibler distance between the true posterior and our gaussian approximation. This analysis (results not shown) reveals that the error is small and rapidly converges to zero.

The mean of our gaussian approximation is the peak of equation 3.2. The key to rapidly updating our posterior is that we can easily compute the direction in which the peak of equation 3.2 lies relative to . Once we know the direction in which lies, we just need to perform a one-dimensional search to find the actual peak. To compute the direction of , we write out the gradient of equation 3.2,
formula
3.4
formula
3.5
At the peak of the log posterior, the gradient equals zero, which means the first term in equation 3.5 must be parallel to . Since Ct−1 is nonsingular, must be parallel to ,
formula
3.6
Δt is a scalar that measures the magnitude of the difference, . We find Δt by solving the following one-dimensional equation using Newton's method:
formula
3.7
This equation defines the location of the peak of the log posterior in the direction . Since the log posterior is concave, equation 3.7 is the solution to a one-dimensional concave optimization problem. Equation 3.7 is therefore guaranteed to have a single, unique solution. Solving this one-dimensional problem involves a single matrix-vector multiplication that requires O(d2) time.
Having found , we estimate the covariance matrix Ct of the posterior by forming the Taylor approximation of equation 3.2 about :
formula
3.8
formula
3.9
formula
3.10
The Laplace method uses the curvature of the log posterior as an estimate of the inverse covariance matrix. The larger the curvature, the more certain we are that our estimate is close to the true parameters. The curvature, as measured by the second derivative, is the sum of two terms, equation 3.10. The first term approximates the information provided by the first t − 1 observations. The second term measures the information in our latest observation, rt. The second term is proportional to the Fisher information. By definition, the Fisher information is the negative of the second derivative of the log likelihood (Berger, 1985). The second derivative of the log likelihood provides an intuitive metric for the informativeness of an observation because a larger second derivative means that small differences in produce large deviations in the responses. Hence, a large Fisher information means we can infer the parameters with more confidence.
To compute the Hessian, the matrix of partial second derivatives, of the log posterior, we need to sum only two matrices: C−1t−1 and the Hessian of . The Hessian of the log likelihood is a rank 1 matrix. We can therefore efficiently invert the Hessian of the updated log posterior in O(d2) time using the Woodbury matrix lemma (Henderson & Searle, 1981; Seeger, 2007). Evaluating the derivatives in equation 3.10 and using the Woodbury lemma yields
formula
3.11
formula
3.12
formula
3.13
formula
3.14
D(rt, ρt) is the one-dimensional Fisher information—the negative of the second derivative of the log likelihood with respect to ρt. In this equation, ρt depends on the unknown parameters, , because we would like to compute the Fisher information for the true parameters. That is, we would like to expand our approximation of the log posterior about . Since is unknown, we use the approximation
formula
3.15
to compute the new covariance matrix. Since computing the covariance matrix is just a rank one update, computing the updated gaussian approximation requires only O(d2) computations. A slower but potentially more accurate update for small t would be to construct our gaussian by matching the first and second moments of the true posterior distribution using the expectation propagation algorithm (Minka, 2001; Seeger, Gerwinn, & Bethge, 2007).

Asymptotically under suitable regularity conditions, the mean of our gaussian is guaranteed to converge to the true . Consistency can be established by applying theorems for the consistency of estimators based on stochastic gradient descent (Fabian, 1978; Sharia, 2007). We used numerical simulations (data not shown) to verify the predictions of these theorems. To apply these theorems to our update, we must be able to restrict to a closed and bounded space. Since all corresponding to neural models would naturally be bounded, this constraint is satisfied for all biologically reasonable GLMs.

Our update uses the Woodbury lemma, which is unstable when Ct is close to being singular. When optimizing under a power constraint (see section 5.2), we can avoid using the Woodbury lemma by computing the eigendecomposition of the covariance matrix. Since we need to compute the eigendecomposition in order to optimize the stimulus, no additional computation is required in this case. When the eigendecomposition was not needed for optimization, we usually found that the Woodbury lemma was sufficiently stable. However, a more stable solution in this case would have been to compute and maintain the Cholesky decomposition of the covariance matrix (Seeger, Steinke, & Tsuda, 2007).

4.  Computing the Mutual Information

A rigorous Bayesian approach to sequential optimal experimental design is to pick the stimulus that maximizes the expected value of a utility function (Bernardo, 1979). Common functions are the mean squared error of the model's predictions (Fedorov, 1972; Cohn, Ghahramani, & Jordan, 1996; Schein, 2005), the entropy of the responses (Bates, Buck, Riccomagno, & Wynn, 1996), and the expected information gain (Lindley, 1956; Bernardo, 1979; MacKay, 1992; Chaloner & Verdinelli, 1995). A number of different quantities can be used to measure the expected information depending on whether the goal is prediction or inference. We are primarily interested in estimating the unknown parameters, so we measure expected information using the mutual information between and the data . The mutual information measures the expected reduction in the number of models consistent with the data. Choosing the optimal design requires maximizing the mutual information, , conditioned on the data already collected as a function of the design ,
formula
4.1
We condition the mutual information on the data already collected because we want to maximize the information given what we have already learned about .

Before diving into a detailed mathematical computation, we want to provide a less technical explanation of our approach. Before we conduct any trials, we have a set, Θ, of possible models. For any stimulus, each model in Θ makes a prediction of the response. To identify the best model, we should pick a stimulus that maximizes the disagreement between the predictions of the different models. In theory, we could measure the disagreement for any stimulus by computing the predicted response for each model. However, since the number of possible models is large, explicitly computing the response for each model is rarely possible.

We can compute the mutual information efficiently because once we pick a stimulus, we partition the model space, Θ, into equivalent sets with respect to the predicted response. Once we fix  , the likelihood of the responses varies only with the projection . Hence, all models with the same value for ρt+1 make the same prediction. Therefore, instead of computing the disagreement among all models in Θ space, we only have to compute the disagreement between the models in these different subspaces; that is, at most, we have to determine the response for one model in each of the subspaces defined by ρt+1 = const.

Of course, the mutual information also depends on what we already know about the fitness of the different models. Since our experiment provides no information about in directions orthogonal to , our uncertainty in these directions will be unchanged. Therefore, the mutual information will only depend on the information we have about in the direction ; that is, it depends only on instead of our full posterior .

Furthermore, we have to evaluate the mutual information only for nonrandom designs because any optimal design must place all of its mass on the stimulus, , which maximizes the conditional mutual information (MacKay, 1992; Paninski, 2005). This property means we can focus on the simpler problem of efficiently evaluating as a function of the input .

The mutual information measures the reduction in our uncertainty about the parameters , as measured by the entropy,
formula
4.2
The first term, , measures our uncertainty at time t. Since is independent of , we just need to minimize the second term, which measures how uncertain about we expect to be after the next trial. Our uncertainty at time t + 1 depends on the response to the stimulus. Since rt+1 is unknown, we compute the expected entropy of the posterior, , as a function of rt+1 and then take the average over rt+1 using our GLM to compute the likelihood of each rt+1 (MacKay, 1992; Chaloner & Verdinelli, 1995). Since the likelihood of rt+1 depends on the unknown model parameters, we also need to take an expectation over . To evaluate the probability of the different , we use our current posterior, .
We compute the posterior entropy, , as a function of rt+1 by first approximating as gaussian. The entropy of a gaussian is easy to compute (Cover & Thomas, 1991):
formula
4.3
formula
4.4
According to our update rule,
formula
4.5
formula
4.6
As discussed in the previous section, the Fisher information depends on the unknown parameters. To compute the entropy, we treat the Fisher information,
formula
4.7
as a random variable since it is a function of . We then estimate our expected uncertainty as the expectation of with respect to using the posterior at time t. The mutual information, equation 4.2, already entails computing an average over  , so we do not need to introduce another integration.
This Bayesian approach to estimating the expected posterior entropy differs from the approach used to update our gaussian approximation of the posterior. To update the posterior at time t, we use the point estimate to estimate the Fisher information of the observation at time t. We could apply the same principle to compute the expected posterior entropy by using the approximation
formula
4.8
where is computed using equations 3.6 and 3.7. Using this approximation of ρt+1 is intractable because we would need to solve for numerically for each value of rt+1. We could solve this problem by using the point approximation , which we can easily compute since is known (MacKay, 1992; Chaudhuri & Mykland, 1993; Cohn, 1994). This point approximation means we estimate the Fisher information for each possible using the assumption that . Unless happens to be close to , there is no reason that the Fisher information computed assuming should be close to the Fisher information evaluated at the true parameters. In particular, at the start of an experiment when is highly inaccurate, we would expect this point approximation to lead to poor estimates of the Fisher information. Similarly, we would expect this point approximation to fail for time-varying systems as the posterior covariance may no longer converge to zero asymptotically (see section 6.2). In contrast to using a point approximation, our approach of averaging the Fisher information with respect to should provide much better estimates of the Fisher information when our uncertainty about is high or when is changing (Lindley, 1956; Chaloner & Verdinelli, 1995). Averaging the expected information of with respect to our posterior leads to an objective function, which takes into account all possible models. In particular, it means we favor inputs that are informative under all models with high probability as opposed to inputs that are informative only if .
To compute the mutual information, equation 4.2, we need to evaluate a high-dimensional expectation over the joint distribution on . Evaluating this expectation is tractable because we approximate the posterior as a gaussian distribution and the log likelihood is one-dimensional. The one-dimensionality of the log likelihood means Ct+1 is a rank 1 update of Ct. Hence, we can use the the identity to evaluate the entropy at time t + 1,
formula
4.9
formula
4.10
Consequently,
formula
4.11
formula
4.12
We can evaluate equation 4.12 without doing any high-dimensional integration because the likelihood of the responses only depends on . As a result,
formula
4.13
Since and is gaussian, ρt+1 is a one-dimensional gaussian variable with mean and variance . The final result is a very simple, two-dimensional expression for our objective function,
formula
4.14
The right-hand side of equation 4.14 is an approximation of the mutual information because the posterior is not in fact gaussian.

Equation 4.14 is a fairly intuitive metric for rating the informativeness of different designs. To distinguish among models, we want the response to be sensitive to . The information increases with the sensitivity because as the sensitivity increases, small differences in produce larger differences in the response, making it easier to identify the correct model. The information, however, also depends on the variability of the responses. As the variability of the responses increases, the information decreases because it is harder to determine which model is more accurate. The Fisher information, D(rt+1, ρt+1), takes into account both the sensitivity and the variability. As the sensitivity increases, the second derivative of the log likelihood increases because the peak of the log likelihood becomes sharper. Conversely, as the variability increases, the log likelihood becomes flatter, and the Fisher information decreases. Hence, D(rt+1, ρt+1) measures the informativeness of a particular response. However, information is valuable only if it tells us something we do not already know. In our objective function, σ2ρ measures our uncertainty about the model. Since our objective function depends on the product of the Fisher information and our uncertainty, our algorithm will favor experiments providing large amounts of new information.

In equation 4.14 we have reduced the mutual information to a two-dimensional integration over ρt+1 and rt+1, which depends on (μρ, σ2ρ). While 2D numerical integration is quite tractable, it could potentially be too slow for real-time applications. A simple solution is to precompute this function before training begins on a suitable 2D region of (μρ, σ2ρ) and then use a lookup table during our experiments.

In certain special cases, we can further simplify the expectations in equation 4.14, making numerical integration unnecessary. One simplification is to use the standard linear approximation log(1 + x) = x + o(x) when D(rt+1, ρt+12ρ is sufficiently small. Using this linear approximation, we can simplify equation 4.14 to
formula
4.15
which may be evaluated analytically in some special cases (see below). If is constant, then this approximation is always justified asymptotically because the variance in all directions asymptotically converges to zero (see section 7). Consequently, σ2ρ → 0 as t → ∞. Therefore, if D(rt+1, ρt+1) is bounded, then asymptotically D(rt+1, ρt+12ρ → 0.

4.1.  Special Case: Exponential Nonlinearity.

When the nonlinear function f() is the exponential function, we can derive an analytical approximation for the mutual information, equation 4.14, because the Fisher information is independent of the observation. This special case is worth considering because the exponential nonlinearity has proved adequate for modeling several types of neurons in the visual system (Chichilnisky, 2001; Pillow, Paninski, Uzzell, Simoncelli, & Chichilnisky, 2005; Rust, Mante, Simoncelli, & Movshon, 2006). As noted in the previous section, the Fisher information depends on the variability and sensitivity of the responses to the model parameters. In general, the Fisher information depends on the response because we can use it to estimate the variability and sensitivity of the neuron's responses. For the Poisson model with convex and increasing f(),1 a larger response indicates more variability but also more sensitivity of the response to ρt+1. For the exponential nonlinearity, the decrease in information due to increased variability and the increase in information due to increased sensitivity with the response cancel out, making the Fisher information independent of the response. Mathematically this means the second derivative of the log likelihood with respect to is independent of rt+1,
formula
4.16
By eliminating the expectation over rt+1 and using the linear approximation log(1 + x) = x + o(x), we can simplify equation 4.14:
formula
4.17
formula
4.18
formula
4.19
We can use the moment-generating function of a gaussian distribution to evaluate this expectation over ρt+1:
formula
4.20
Our objective function is increasing with μρ and σ2ρ. In section 5.2, we show that this property makes optimizing the design for an exponential nonlinearity particularly tractable.

4.2.  Linear Model.

The optimal design for minimizing the posterior entropy of for the standard linear model is a well-known result in the statistics and experimental design literature (MacKay, 1992; Chaloner & Verdinelli, 1995). It is enlightening to rederive these results using the methods we have introduced so far and to point out some special features of the standard linear case.

The linear model is
formula
4.21
with ϵ a zero-mean gaussian random variable with variance σ2. The linear model is a GLM with a gaussian distribution for the conditional distribution and a linear link function:
formula
4.22
formula
4.23
For the linear model, the variability, σ2, is constant. Furthermore, the sensitivity of the responses to the input and the model parameters is also constant. Consequently, the Fisher information is independent of both the response and the input (Chaudhuri & Mykland, 1993). Mathematically this means that the observed Fisher information D(rt+1, ρt+1) is a constant equal to the reciprocal of the variance:
formula
4.24
Plugging D(rt+1, ρt+1) into equation 4.14, we obtain the simple result:
formula
4.25
Since σ2 is a constant, we can increase the mutual information only by picking stimuli for which is maximized. Under the power constraint, σ2ρ is maximized when all the stimulus energy is parallel to the maximum eigenvector of Ct, the direction of maximum uncertainty. μρ does not affect the optimization at all. This property distinguishes the linear model from the exponential Poisson case described above. Furthermore, the covariance matrix Ct is independent of past responses because the true posterior is gaussian with covariance matrix:
formula
4.26
Consequently, the optimal sampling strategy can be determined a priori, without having to observe rt or to make any corresponding adjustments in our sampling strategy (MacKay, 1992).

Like the Poisson model with an exponential link function, the linear model's Fisher information is independent of the response. However, for the linear model, the Fisher information is also independent of the model parameters. Since the Fisher information is independent of the parameters, an adaptive design offers no benefit because we do not need to know the parameters to select the optimal input. In contrast, for the Poisson distribution with an exponential link function, the Fisher information depends on the parameters and the input, even though it is independent of the responses. As a result, we can improve our design by adapting it as our estimate of improves.

5.  Choosing the Optimal Stimulus

The simple expression for the conditional mutual information, equation 4.14, means that we can find the optimal stimulus by solving the following simple program:
formula
5.1
formula
5.2
formula
5.3
formula
5.4
is the range of the mapping corresponding to the stimulus domain, . Once we have computed , we need to solve a highly tractable 2D optimization problem numerically. The final step is to map the optimal (μρ, σ2ρ) back into the input space. In general, computing for arbitrary stimulus domains is the hardest step.

We first present a general procedure for handling arbitrary stimulus domains. This procedure selects the optimal stimulus from a set, , which is a subset of . contains a finite number of inputs; its size will be denoted . Picking the optimal input in is easy. We simply compute (μρ, σ2ρ) for each .

Picking the optimal stimulus in a finite set, , is flexible and straightforward. The informativeness of the resulting design, however, is highly dependent on how is constructed. In particular, we want to ensure that with high probability, contains inputs in that are nearly optimal. If we could compute , then we could avoid the problem of picking a good . One case in which we can compute is when is defined by a power constraint; that is, is a sphere. Since we can compute , we can optimize the input over its full domain. Unfortunately, our method for computing cannot be applied to arbitrary input domains.

5.1.  Optimizing over a Finite Set of Stimuli.

Our first method simultaneously addresses two issues: how to deal with arbitrary stimulus domains and what to do if the stimulus domain is ill defined. In general, we expect that more efficient procedures for mapping a stimulus domain into could be developed by taking into account the actual stimulus domain. However, a generalized procedure is needed because efficient algorithms for a particular stimulus domain may not exist, or their development may be complex and time-consuming. Furthermore, for many stimulus domains (i.e., natural images), we have many examples of the stimuli but no quantitative constraints that define the domain. An obvious solution to both problems is to simply choose the best stimulus from a subset of examples, .

The challenge with this approach is picking the set . For the optimization to be fast, needs to be sufficiently small. However, we also want to ensure that contains an optimal or nearly optimal input. In principle, this second criterion means should contain a large number of stimuli evenly dispersed over . We can in fact satisfy both requirements because the informativeness of a stimulus depends on only (μρ, σ2ρ). Consequently, we can partition into sets of equally informative experiments based on the value of (μρ, σ2ρ). When constructing , there is no reason to include more than one input for each value of (μρ, σ2ρ) because all of these inputs are equally informative. Hence, to ensure that contains a nearly optimal input, we just need its stimuli to span the two-dimensional and not the much higher-dimensional space, .

Although and Ct change with time, these quantities are known when optimizing . Hence, the mapping is known and easy to evaluate for any stimulus. We can use this knowledge to develop simple heuristics for selecting inputs that tend to be dispersed throughout . We delay until sections 5.3 and 6.1 the presentation of the heuristics that we used in our simulations so that we can first introduce the specific problems and stimulus domains for which these heuristics are suited.

5.2.  Power Constraint.

Ideally, we would like to optimize the input over its full domain as opposed to restricting ourselves to a subset of inputs. Here we present a method for computing when is defined by the power constraint .2 This is an important stimulus domain because of its connection to white noise, which is often used to study sensory systems (Eggermont, 1993; Cottaris & De Valois, 1998; Chichilnisky, 2001; Dayan Abbot, 2001; Wu, David, & Gallant, 2006). Under an i.i.d. design, the stimuli sampled from resemble white noise. The primary difference is that we strictly enforce the power constraint, whereas for white noise, the power constraint applies only to the average power of the input. The domain is also worth considering because it defines a large space that includes many important subsets of stimuli such as random dot patterns (DiCarlo, Johnson, & Hsiao, 1998).

Our main result is a simple, efficient procedure for finding the boundary of as a function of a 1D variable. Our procedure uses the fact that is closed and connected. Furthermore, for fixed μρ, σ2ρ is continuous on the interval between its maximum and minimum values. These properties of mean we can compute the boundary of by maximizing and minimizing σ2ρ as a function of μρ. consists of all points on this boundary as well as the points enclosed by this curve (Berkes & Wiskott, 2005):
formula
5.5
formula
5.6
formula
5.7

By solving equations 5.6 and 5.7, we can walk along the curves that define the upper and lower boundaries of as a function of μρ. To move along these curves, we simply adjust the value of the linear constraint. As we walk along these curves, the quadratic constraint ensures that we do not violate the power constraint that defines the stimulus domain.

We have devised a numerically stable and fast procedure for computing the boundary of . Our procedure uses linear algebraic manipulations to eliminate the linear constraints in equations 5.6 and 5.7. To eliminate the linear constraint, we derive an alternative quadratic expression for σ2ρ in terms of ,
formula
5.8
Here we discuss only the most important points regarding equation 5.8; the derivation and definition of the terms are provided in appendix A. The linear term of this modified quadratic expression ensures that the value of this expression is independent of the projection of on . The constant term ensures that the value of this expression equals the value of σ2ρ if we forced the projection of on to μρ. Maximizing and minimizing σ2ρ subject to linear and quadratic constraints is therefore equivalent to maximizing and minimizing this modified quadratic expression with just the quadratic constraint.

To maximize and minimize equation 5.8 subject to the quadratic constraint , we use the Karush-Kuhn-Tucker (KKT) conditions. For these optimization problems, it can be proved that the KKT are necessary and sufficient (Fortin, 2000). To compute the boundary of as a function of μρ, we need to solve the KKT for each value of μρ. This approach is computationally expensive because for each value of μρ, we need to find the value of the Lagrange multiplier by finding the root of a nonlinear function. We have devised a much faster solution based on computing μρ as a function of the Lagrange multiplier; the details are in appendix A. This approach is faster because to compute μρ as a function of the Lagrange multiplier, we need only find the root of a 1D quadratic expression.

To solve the KKT conditions, we need the eigendecomposition of A. Computing the eigendecomposition of A is the most expensive operation and, in the worst case, requires O(d3) operations. A, however, is a rank 2 perturbation of Ct, equation A.11. When these perturbations are orthogonal to some of the eigenvectors of Ct, we can reduce the number of computations needed to compute the eigendecomposition of Ct by using the Gu-Eisenstat algorithm (Gu & Eisenstat, 1994), as discussed in the next section. The key point is that we can on average compute the eigendecomposition in O(d2) time.

Having computed , we can perform a 2D search to find the pair (μρ, σ2ρ)*, which maximizes the mutual information, thereby completing step 1 in our program. To finish the program, we need to find an input such that and . We can easily find one solution by solving a one-dimensional quadratic equation. Let and denote the inputs corresponding to (μ*ρ, σ2ρ min) and (μ*ρ, σ2ρ max), respectively. These inputs are automatically computed when we compute the boundary of . To find a suitable , we find a linear combination of these two vectors that yields σ2*ρ:
formula
5.9
formula
5.10
All necessarily satisfy the power constraint because it defines a convex set, and is a linear combination of two stimuli in this set. Similar reasoning guarantees that has projection μ*ρ on . Although this maximizes the mutual information with respect to the full stimulus domain under the power constraint, this solution may not be unique. Finding γ completes the optimization of the input under the power constraint.
In certain cases, we can reduce the two-dimensional search over to an even simpler one-dimensional search. If the mutual information is monotonically increasing in σ2ρ, then we need to consider only σ2ρ ,maxρ) for each possible value of μρ. Consequently, a one-dimensional search over σ2ρ ,maxρ) for is sufficient for finding the optimal input. A sufficient condition for guaranteeing that the mutual information increases with σ2ρ is convexity of in ρt+1 (see appendix B). An important example satisfying this condition is ft+1) = exp(ρt+1), which satisfies the convexity condition because
formula
5.11

5.3.  Heuristics for the Power Constraint.

Although we can compute when , efficient heuristics for picking subsets of stimuli are still worth considering. If the size of the subset of stimuli is small enough, then computing (μρ, σ2ρ) for each stimulus in the subset is usually faster than computing for the entire stimulus domain. Since we can set the size of the set to any positive integer, by decreasing the size of the set we can sacrifice accuracy, in terms of finding the optimal stimulus, for speed.

We developed a simple heuristic for constructing finite subsets of by taking linear combinations of the mean and maximum eigenvector. To construct a subset, , of the closed ball, we use the following procedure:

  1. Generate a random number, ω, uniformly from the interval [ − m, m], where m2 is the stimulus power.

  2. Generate a random number, ϕ, uniformly from the interval .

  3. Add the input to , where . is the maximum eigenvector of Cx,t.

This procedure tends to produce a set of stimuli that are dispersed throughout . By varying the projection of along the MAP, the heuristic tries to construct a set of stimuli for which the values of μρ are uniformly distributed on the valid interval. Similarly, by varying the projection of each stimulus along the maximum eigenvector, we can adjust the value of σ2ρ for each stimulus. Unfortunately, the subspace of the stimulus domain spanned by the mean and max eigenvector may not contain the stimuli that map to the boundaries of . Nonetheless, since this heuristic produces stimuli that tend to be dispersed throughout , we can usually find a stimulus in that is close to being optimal.

When the mutual information is increasing with σ2ρ, we can easily improve this heuristic. In this case, the optimal stimulus always lies on the sphere . Therefore, when constructing the stimuli in a finite set, we should pick only stimuli that are on this sphere. To construct such a subset, , we use the heuristic above except we set . Since the mutual information for the exponential Poisson model is increasing with σ2ρ, our simulations for this model will always use as opposed to .

We could also have constructed subsets of the stimulus domain, , by uniformly sampling the ball or sphere. Unfortunately, this process produces sets that rarely contain highly informative stimuli, particularly in high dimensions. Since the uniform distribution on the sphere is radially symmetric, and the covariance matrix of is diagonal with entries . As a result, the variance of μρ, decreases as 1/d, ensuring that for high-dimensional systems, the stimuli in have μρ close to zero with high probability (see Figure 4). Uniformly sampling the ball or sphere therefore does a poor job of selecting stimuli that are dispersed throughout . As a result, is unlikely to contain stimuli close to being maximally informative.

Figure 4:

A plot showing , equation 5.5. The gray scale indicates the objective function, the log of equation 4.20. The dots and crosses show the points corresponding to the stimuli in and respectively. The dark gray region centered at μρ = 0 shows the region containing all stimuli in . To make the points easy to see, we kept the size of and small: , . The points on the boundary corresponding to the largest and smallest values of μρ correspond to stimuli that are parallel and antiparallel to . The posterior used to compute these quantities was the posterior after 3000 trials for the Gabor simulation described in the text. The posterior was taken from the design, which picked the optimal stimulus in (i.e., is the image shown in the first row and third column of Figure 5).

Figure 4:

A plot showing , equation 5.5. The gray scale indicates the objective function, the log of equation 4.20. The dots and crosses show the points corresponding to the stimuli in and respectively. The dark gray region centered at μρ = 0 shows the region containing all stimuli in . To make the points easy to see, we kept the size of and small: , . The points on the boundary corresponding to the largest and smallest values of μρ correspond to stimuli that are parallel and antiparallel to . The posterior used to compute these quantities was the posterior after 3000 trials for the Gabor simulation described in the text. The posterior was taken from the design, which picked the optimal stimulus in (i.e., is the image shown in the first row and third column of Figure 5).

5.4.  Simulation Results.

We tested our algorithm using computer simulations that roughly emulated typical neurophysiology experiments. The main conclusion of our simulations is that using our information-maximizing (infomax) design, we can reduce by an order of magnitude the number of trials needed to estimate (Paninski, 2005). This means we can increase the complexity of neural models without having to increase the number of data points needed to estimate the parameters of these higher-dimensional models. Furthermore, our results show that we can perform the computations fast enough—between 10 m and 1 sec depending on —that our algorithm could be used online, during an experiment, without requiring expensive or custom hardware.

Our first simulation used our algorithm to learn the receptive field of a visually sensitive neuron. The simulation tested the performance of our algorithm with a high-dimensional input space. We took the neuron's receptive field to be a Gabor function as a proxy model of a V1 simple cell (Ringach, 2002). We generated synthetic responses by sampling equation 2.3 with set to a 40 × 40 Gabor patch. The nonlinearity was the exponential function.

Plots of the posterior means (recall these are equivalent to the MAP estimate of ) for several designs are shown in Figure 5. The results show that all infomax designs do better than an i.i.d. design, and an infomax design that optimizes over the full domain of the input, , does much better than choosing the best stimulus in a subset constructed by uniformly sampling .

Figure 5:

The receptive field, , of a simulated neuron estimated using different designs. The neuron's receptive field was the 40 × 40 Gabor patch shown in the last column (spike history effects were set to zero for simplicity, ). The stimulus domain was defined by a power constraint . The top three rows show the MAP if we pick the optimal stimulus in , , and respectively. , and contained 1000 stimuli. The final four rows show the results for an i.i.d. design, a design that set , a design that set the stimulus to the maximum eigenvector of Ct, and a design that used sinusoidal gratings with random spatial frequency, orientation, and phase. Selecting the optimal stimulus in or leads to much better estimates of using fewer stimuli than the other methods.

Figure 5:

The receptive field, , of a simulated neuron estimated using different designs. The neuron's receptive field was the 40 × 40 Gabor patch shown in the last column (spike history effects were set to zero for simplicity, ). The stimulus domain was defined by a power constraint . The top three rows show the MAP if we pick the optimal stimulus in , , and respectively. , and contained 1000 stimuli. The final four rows show the results for an i.i.d. design, a design that set , a design that set the stimulus to the maximum eigenvector of Ct, and a design that used sinusoidal gratings with random spatial frequency, orientation, and phase. Selecting the optimal stimulus in or leads to much better estimates of using fewer stimuli than the other methods.

The results in Figures 5 and 6 show that if we choose the optimal stimulus from a finite set, then intelligently constructing the set is critical to achieving good performance. We compared two approaches for creating the set when . The first approach selected a set of stimuli, , by uniformly sampling . The second approach constructed a set for each trial using the heuristic presented in section 5.3. Picking the optimal stimulus in produced much better estimates of than picking the optimal stimulus in . In particular, the design using converged to nearly as fast as the design that optimized over the full stimulus domain, . These results show that using is more efficient than reusing the same set of stimuli for all trials. To achieve comparable results using , we would have to increase the number of stimuli by several orders of magnitude. Consequently, the added cost of constructing a new stimulus set after each trial is more than offset by our ability to use fewer stimuli compared to using a constant set of stimuli.

Figure 6:

The posterior entropies for the simulations shown in Figure 5. Picking the optimal input from decreases the entropy much faster than restricting ourselves to a subset of . However, if we pick a subset of stimuli using our heuristic, then we can decrease the entropy almost as fast as when we optimize over the full input domain. Note that the gray squares corresponding to the i.i.d. design are being obscured by the black triangles.

Figure 6:

The posterior entropies for the simulations shown in Figure 5. Picking the optimal input from decreases the entropy much faster than restricting ourselves to a subset of . However, if we pick a subset of stimuli using our heuristic, then we can decrease the entropy almost as fast as when we optimize over the full input domain. Note that the gray squares corresponding to the i.i.d. design are being obscured by the black triangles.

We also compared the infomax designs to the limiting cases where we put all stimulus energy along the mean or maximum eigenvector (see Figures 5 and 6). Putting all energy along the maximum eigenvector performs nearly as well as an i.i.d. design. Our update, equation 3.12, ensures that if the stimulus is an eigenvector of Ct, the updated covariance matrix is the result of shrinking the eigenvalue corresponding to that eigenvector. Consequently, setting the stimulus to the maximum eigenvector ends up scanning through the different eigenvectors on successive trials. The resulting sequence of stimuli is statistically similar to that of an i.i.d. design because the stimuli are highly uncorrelated with each other and with . As a result, both methods generate similar marginal distributions with sharp peaks at 0. Since the Fisher information of a stimulus under the power constraint varies only with , both methods pick stimuli that are roughly equally informative. Consequently, both designs end up shrinking the posterior entropy at similar rates.

In contrast, making the stimulus on each trial parallel to the mean leads to a much slower initial decrease of the posterior entropy. Since our initial guess of the mean is highly inaccurate, is close to zero, resulting in a small value for the Fisher information. Furthermore, sequential stimuli end up being highly correlated. As a result, we converge very slowly to the true parameters.

We also evaluated a design that used sinusoidal gratings as the stimuli. In Figure 5, this design produces an estimate of that already has the basic inhibitory and excitatory pattern of the receptive field after just 1000 trials. However, on the remaining trials improves very little. Figure 6 shows that this design decreases the entropy at roughly the same rate as the i.i.d. design. The reason the coarse structure of the receptive field appears after so few trials is that the stimuli have a large amount of spatial correlation. This spatial correlation among the stimuli induces a similar correlation among the components of the MAP and explains why the coarse inhibitory and excitatory pattern of the receptive field appears after so few trials. However, it also makes it difficult to estimate the higher-resolution features of , which is why does not improve much between 1000 and 5000 trials.

Similar results to Figure 5 in Paninski (2005) used a brute force computation and optimization of the mutual information. The computation in Paninski (2005) was possible only because was assumed to be a Gabor function specified by just three parameters (the 2D location of its center and its orientation). Similarly, the stimuli were constrained to be Gabor functions. Our simulations did not assume that or was Gabor. could have been any 40 × 40 image with power m2. Attempting to use brute force in this high-dimensional space would have been hopeless. Our results show that a sequential optimal design allows us to perform system identification in high-dimensional spaces that might otherwise be tractable only by making strong assumptions about the system.

The fact that we can pick the stimulus to increase the information about the parameters, , that determine the dependence of the firing rate on the stimulus is unsurprising. Since we are free to pick any stimulus, by choosing an appropriate stimulus we can distinguish among different values of . Our GLM, however, can also include spike history terms. Since we cannot fully control the spike history, a reasonable question is whether infomax can improve our estimates of the spike history coefficients, . Figure 7 shows the results of a simulation characterizing the receptive field of a neuron whose response depends on its past spiking. The unknown parameter vector, , consists of the stimulus coefficients , which were a 1D Gabor function, and the spike history coefficients, , which were inhibitory and followed an exponential function. The nonlinearity was the exponential function.

Figure 7:

A comparison of parameter estimates using an infomax design versus an i.i.d. design for a neuron whose conditional intensity depends on both the stimulus and the spike history. (a) Estimated stimulus coefficients , after 500 and 1000 trials for the true model (dashed gray), infomax design (solid black) and an i.i.d. design (solid gray). (b) MSE of the estimated stimulus coefficients for the infomax design (solid black line) and i.i.d. design (solid gray line). (c) Estimated spike history coefficients, , after 500 and 1000 trials. (d) MSE of the estimated spike history coefficients.

Figure 7:

A comparison of parameter estimates using an infomax design versus an i.i.d. design for a neuron whose conditional intensity depends on both the stimulus and the spike history. (a) Estimated stimulus coefficients , after 500 and 1000 trials for the true model (dashed gray), infomax design (solid black) and an i.i.d. design (solid gray). (b) MSE of the estimated stimulus coefficients for the infomax design (solid black line) and i.i.d. design (solid gray line). (c) Estimated spike history coefficients, , after 500 and 1000 trials. (d) MSE of the estimated spike history coefficients.

The results in Figure 7 show that an infomax design leads to better estimates of both and . Figure 7 shows the MAPs of both methods on different trials, as well as the mean squared error (MSE) on all trials. In Figure 7, the MSE increases on roughly the first 100 trials because the mean of the prior is zero. The data collected on these early trials tend to increase the magnitude of . Since the true direction of is still largely unknown, the increase in the magnitude of tends to increase the MSE.

By converging more rapidly to the stimulus coefficients, the infomax design produces a better estimate of how much of the response is due to , which leads to better estimates of . The size of this effect is measured by the correlation between and , which is given by Cx,f in equation A.3. Consider a simple example where the first entry of Cx,f is negative and the remaining entries are zero. In this example, and (the first components of and , respectively) would be anticorrelated. This value of Cx,f roughly means that the log posterior remains relatively constant if we increase but decrease . If we knew the value of , then we would know where along this line of equal probability the true parameters were located. As a result, increasing our knowledge about also reduces our uncertainty about .

5.4.1.  Running Time.

Our algorithm is suited to high-dimensional, real-time applications because it reduces the exponential complexity of choosing the optimal design to on average quadratic and at worst cubic running time. We verified this claim empirically by measuring the running time for each step of the algorithm as a function of the dimensionality of , Figure 8a.3 These simulations used a GLM with an exponential link function. This nonlinearity leads to a special case of our algorithm because we can derive an analytical approximation of our objective function, equation 4.20, and only a one-dimensional search in is required to find the optimal input. These properties facilitate implementation but do not affect the complexity of the algorithm with respect to d. Using a lookup table instead of an analytical expression to estimate the mutual information as a function of (μρ, σ2ρ) would not change the running time with respect to d because is always 2D. Similarly, the increased complexity of a full 2D search compared to a 1D search in is independent of d.

Figure 8:

(a) Running time of the four steps that must be performed on each iteration as a function of the dimensionality of . The total running time as well as the running times of the eigendecomposition of the covariance matrix (eigen.), eigendecomposition of A in equation A.11 (quadratic modification), and posterior update were well fit by polynomials of degree 2. The time required to optimize the stimulus as a function of λ was well fit by a line. The times are the median over many iterations. (b) The running time of the eigendecomposition of the posterior covariance on average grows quadratically because many of our eigenvectors remain unchanged by the rank 1 perturbation. We verified this claim empirically for one simulation by plotting the number of modified eigenvectors as a function of the trial. The data are from a 20 × 10 Gabor simulation.

Figure 8:

(a) Running time of the four steps that must be performed on each iteration as a function of the dimensionality of . The total running time as well as the running times of the eigendecomposition of the covariance matrix (eigen.), eigendecomposition of A in equation A.11 (quadratic modification), and posterior update were well fit by polynomials of degree 2. The time required to optimize the stimulus as a function of λ was well fit by a line. The times are the median over many iterations. (b) The running time of the eigendecomposition of the posterior covariance on average grows quadratically because many of our eigenvectors remain unchanged by the rank 1 perturbation. We verified this claim empirically for one simulation by plotting the number of modified eigenvectors as a function of the trial. The data are from a 20 × 10 Gabor simulation.

The main conclusion of Figure 8a is that the complexity of our algorithm on average grows quadratically with the dimensionality. The solid black line shows a polynomial of degree 2 fitted to the total running time. We also measured the running time of the four steps that make up our algorithm: (1) updating the posterior, (2) computing the eigendecomposition of the covariance matrix, (3) modifying the quadratic form for σ2ρ to eliminate the linear constraint (i.e., finding the eigendecomposition of A in equation A.11) and (4) finding the optimal stimulus. The solid lines indicate fitted polynomials of degree 1 for optimizing the stimulus and degree 2 for the remaining curves. Optimizing the stimulus entails searching along the upper boundary of for the optimal pair (μ*ρ, σ2*ρ) and then finding an input that maps to (μ*ρ, σ2*ρ). The running time of these operations scales as O(d) because computing σ2ρ ,max as a function of λ requires summing d terms, equation A.17. When was 100 dimensions, the total running time was about 10 ms, which is within the range of tolerable latencies for many experiments. Consequently, these results support our conclusion that our algorithm can be used in high-dimensional, real-time applications.

When we optimize under the power constraint, the bottleneck is computing the eigendecomposition. In the worst case, the cost of computing the eigendecomposition will grow as O(d3). Figure 8a, however, shows that the average running time of the eigendecomposition grows only quadratically with the dimensionality. The average running time grows as O(d2) because most of the eigenvectors remain unchanged after each trial. The covariance matrix after each trial is a rank 1 perturbation of the covariance matrix from the previous trial, and every eigenvector orthogonal to the perturbation remains unchanged. A rank 1 update can be written as
formula
5.12
where M and M′ are the old and perturbed matrices, respectively. Clearly, any eigenvector, , of M orthogonal to the perturbation, , is also an eigenvector of M′ because
formula
5.13
where c is the eigenvalue corresponding to .

If the perturbation leaves most of our eigenvectors and eigenvalues unchanged, then we can use the Gu-Eisenstat algorithm to compute fewer than d eigenvalues and eigenvectors, thereby achieving on average quadratic running time (Gu & Eisenstat, 1994; Demmel, 1997; Seeger, 2007). Asymptotically, we can prove that the perturbation is correlated with at most two eigenvectors (see section 7). Consequently, asymptotically we need to compute at most two new eigenvectors on each trial. These asymptotic results, however, are not as relevant for the actual running time as empirical results. In Figure 8b, we plot, for one simulation, the number of eigenvectors that are perturbed by the rank 1 modification. On most trials, fewer than d eigenvectors are perturbed by the update. These results rely to some extent on the fact that our prior covariance matrix was white and hence had only one distinct eigenvalue. On each subsequent iteration, we can reduce the multiplicity of this eigenvalue by at most one. Our choice of prior covariance matrix therefore helps us manage the complexity of the eigendecomposition.

6.  Important Extensions

In this section we consider two extensions of the basic GLM that expand the range of neurophysiology experiments to which we can apply our algorithm: handling nonlinear transformations of the input and dealing with time-varying . In both cases, our method for picking the optimal stimulus from a finite set requires only slight modifications. Unfortunately, our procedure for picking the stimulus under a power constraint will not work if the input is pushed through a nonlinearity.

6.1.  Input Nonlinearities.

Neurophysiologists routinely record from neurons that are not primary sensory neurons. In these experiments, the input to a neuron is a nonlinear function of the stimulus due to the processing in earlier layers. To make our algorithm work in these experiments, we need to extend our GLM to model the processing in these earlier layers. The extended model shown in Figure 9 is a nonlinear-linear-nonlinear (NLN) cascade model (Wu et al., 2006; Ahrens, Paninski, & Sahani, 2008; Paninski et al., 2007). The only difference from the original GLM is how we define the input:
formula
6.1
Figure 9:

A GLM in which we first transform the input into some feature space defined by the nonlinear functions —in this case, squaring functions.

Figure 9:

A GLM in which we first transform the input into some feature space defined by the nonlinear functions —in this case, squaring functions.

The input now consists of nonlinear transformations of the stimulus. The nonlinear transformations are denoted by the functions Wi. These functions map the stimulus into feature space a simple example being the case where the functions Wi represent a filter bank. nw denotes the number of nonlinear basis functions used to transform the input. For convenience, we denote the output of these transformations as . As before, our objective is picking the stimulus that maximizes the mutual information about the parameters, . For simplicity, we have assumed that the response does not depend on past stimuli, but this assumption could easily be dropped.

NLN models are frequently used to explain how sensory systems process information. In vision, for example, MT cells can be modeled as a GLM whose input is the output of a population of V1 cells (Rust et al., 2006). In this model, V1 is modeled as a population of tuning curves whose output is divisively normalized. Similarly in audition, cochlear processing is often represented as a spectral decomposition using gammatone filters (de Boer & de Jongh, 1978; Patterson et al., 1992; Lewicki, 2002; Smith & Lewicki, 2006). NLN models can be used to model this spectral decomposition of the auditory input, as well as the subsequent integration of information across frequency (Gollisch, Schutze, Benda, & Herz, 2002). One of the most important NLN models in neuroscience is the energy model. In vision, energy models are used to explain the spatial invariance of complex cells in V1 (Adelson & Bergen, 1985; Dayan & Abbot, 2001). In audition, energy models are used to explain frequency integration and phase insensitivity in auditory processing (Gollisch et al., 2002; Carlyon & Shamma, 2003).

Energy models integrate information by summing the energy of the different input signals. The expected firing rate is a nonlinear function of the integrated energy,
formula
6.2
Each linear filter, , models the processing in an earlier layer or neuron. For simplicity, we present the energy model assuming the firing rate does not depend on past spiking. As an example of the energy model, consider a complex cell. In this model, each models a simple cell. The complex cell then sums the energy of the outputs of the simple cells.
Energy models are an important class of models compatible with the extended GLM shown in Figure 9. To represent an energy model in our framework, we need to express energy integration as an NLN cascade. We start by expressing the energy of each channel as a vector matrix multiplication by introducing the matrices Qi,
formula
6.3
The right-hand side of this expression has more degrees of freedom than our original energy model unless we restrict Qi to be a rank 1 matrix. Letting Q = ∑iQi, we can write the energy model as
formula
6.4
where xi,t denotes the ith component of . This model is linear in the matrix coefficients Qi,j and the products of the stimulus components xi,txj,t. To obtain a GLM, we use the input nonlinearity, , to map to the vector [x1,tx1,t, …, xi,txj,t, …]T. The parameter vector for the energy model is the matrix Q rearranged as a vector , which acts on feature space not stimulus space.

Using the functions, Wi, to project the input into feature space does not affect our strategy for picking the optimal stimulus from a finite set. We simply have to compute for each stimulus before projecting it into and computing the mutual information. Our solution for optimizing the stimulus under a power constraint, however, no longer works for two reasons. First, a power constraint on does not in general translate into a power constraint on the values of . As a result, we cannot use the algorithm of section 5.2 to find the optimal values of . Second, assuming we could find the optimal values of , we would need to invert to find the actual stimulus. For many nonlinearities, the energy model being one example, is not invertible.

To estimate the parameters of an energy model, we use our existing update method to construct a gaussian approximation of the posterior in feature space, . We can then use the MAP to estimate the input filters . The first step is to rearrange the terms of the mean, , as a matrix, . We then estimate the input filters, , by computing the singular value decomposition (SVD) of . If converges to the true value, then the subspace corresponding to its nonzero singular values should equal the subspace spanned by the true filters, .

Since we can optimize the design only with respect to a finite set of stimuli, we devised a heuristic for making this set more dispersed throughout . For the energy model,
formula
6.5
formula
6.6
formula
6.7
formula
6.8
where μi,t is the ith component of . rt in this example has no dependence on past responses; hence, we do not need to sum over the past responses to compute μρ (i.e., ta = 0). is just the MAP, , rearranged as a matrix. We construct each stimulus in as follows:
  1. We randomly pick an eigenvector, , of with the probability of picking each eigenvector being proportional to the relative energy of the corresponding eigenvalue.

  2. We pick a random number, ω, by uniformly sampling the interval [ − m, m], where m2 is the maximum allowed stimulus power.

  3. We choose a direction, , orthogonal to by uniformly sampling the d − 1 unit sphere orthogonal to .

  4. We add the stimulus,
    formula
    to .

This heuristic works because for the energy model, measures the energy of the stimulus in feature space. For this model, feature space is defined by the eigenvectors of Q. Naturally, if we want to increase ρt+1, we should increase the energy of the stimulus along one of the basis vectors of feature space. The eigenvectors of are our best estimate for the basis vectors of feature space. Hence, μρ, the expected value of ρt+1, varies linearly with the energy of the input along each eigenvector of , equation 6.7.

The effectiveness of our heuristic is illustrated in Figure 10. This figure illustrates the mapping of stimuli into space for stimulus sets constructed using our heuristic, , and stimulus sets produced by uniformly sampling the sphere, . Our heuristic produces a set of stimuli that is more spread out on the range of μρ. As a result, contains more informative stimuli than .

Figure 10:

Plot shows the mapping of different stimulus sets into after 500 trials. consists of 1000 stimuli selected using the heuristic described in the text. consists of 1000 stimuli randomly sampled from the sphere . is a set of 1000 pure tones with random phase and frequency, and power equal to m2. All mappings were computed using the same posterior, which was taken from the simulation that picked the optimal stimulus in on each trial. The shading of the dots is proportional to the mutual information of each input, equation 4.20. The plots show that contains more informative stimuli than and and that the stimuli in are more dispersed in (μρ, σ2ρ) space.

Figure 10:

Plot shows the mapping of different stimulus sets into after 500 trials. consists of 1000 stimuli selected using the heuristic described in the text. consists of 1000 stimuli randomly sampled from the sphere . is a set of 1000 pure tones with random phase and frequency, and power equal to m2. All mappings were computed using the same posterior, which was taken from the simulation that picked the optimal stimulus in on each trial. The shading of the dots is proportional to the mutual information of each input, equation 4.20. The plots show that contains more informative stimuli than and and that the stimuli in are more dispersed in (μρ, σ2ρ) space.

6.1.1.  Auditory Simulation.

We applied these estimation and optimization procedures to a simulation of an auditory neuron. We modeled the neuron using an energy model. For simplicity, our hypothetical neuron received input from just two neurons in earlier layers. We modeled these input neurons as gammatone filters that were identical except for a 90 degree difference in phase (de Boer & de Jongh, 1978; Patterson et al., 1992). We generated spikes by sampling a conditional Poisson process whose instantaneous, conditional firing rate was set by equation 6.2 with , and being the gammatone filters, and ft+1) = exp(ρt+1). We estimated the parameters, Q, using an i.i.d. and two infomax designs. The i.i.d. design uniformly sampled the stimulus from the sphere . The two infomax designs picked the optimal stimulus in a subset of stimuli drawn from the sphere. In one case, this set was constructed using our heuristic, while in the other case, it was constructed by uniformly sampling the sphere.

The results of our simulations are shown in Figure 11. When finding the MAP of , we restricted such that the corresponding matrices, , were symmetric but not necessarily rank 2. The rank 2 restriction is unnecessary because the number of linear filters can be recovered from the number of nonzero singular values of . To show how well the true gammatone filters can be estimated from the principal components of , we show in Figure 11 the reconstruction of and using the first two principal components of that is, the linear combination of the projections of each filter along the first two principal components.

Figure 11:

Simulation results for the hypothetical auditory neuron described in the text. Simulated responses were generated using equation 6.2 with and being gammatone filters. These filters were identical except for the phase, which differed by 90 degrees. The results compare an i.i.d. design, two infomax designs, and a design using pure tones. The two infomax designs picked the optimal stimulus in the sets and respectively; both sets contained 1000 inputs. The i.i.d. design picked the input by uniformly sampling the sphere . The pure tones had random frequency and phase but power equal to m2. To illustrate how well and can be estimated, we plot the reconstruction of and using the first two principal components of the estimated Q. The infomax design using a heuristic does much better than an i.i.d. design. For the infomax design, the gammatone structure of the two filters is evident starting around 100 and 500 trials, respectively. By 1000 trials, the infomax design using has essentially converged to the true parameters, whereas for the i.i.d. design, the gammatone structure is starting to be revealed only after 1000 trials.

Figure 11:

Simulation results for the hypothetical auditory neuron described in the text. Simulated responses were generated using equation 6.2 with and being gammatone filters. These filters were identical except for the phase, which differed by 90 degrees. The results compare an i.i.d. design, two infomax designs, and a design using pure tones. The two infomax designs picked the optimal stimulus in the sets and respectively; both sets contained 1000 inputs. The i.i.d. design picked the input by uniformly sampling the sphere . The pure tones had random frequency and phase but power equal to m2. To illustrate how well and can be estimated, we plot the reconstruction of and using the first two principal components of the estimated Q. The infomax design using a heuristic does much better than an i.i.d. design. For the infomax design, the gammatone structure of the two filters is evident starting around 100 and 500 trials, respectively. By 1000 trials, the infomax design using has essentially converged to the true parameters, whereas for the i.i.d. design, the gammatone structure is starting to be revealed only after 1000 trials.

Figures 11 and 12 show that when the optimal stimulus in is picked, the MAP converges more rapidly to the true gammatone filters. In Figure 11, the design that uses pure tones as the inputs appears to produce good estimates of the filters. These results, however, are somewhat misleading. Since these inputs are restricted to tones, the inputs that cause the neuron to fire are highly correlated. As a result, the estimated receptive field is biased by the correlations in the input. Since gammatone filters are similar to sine waves, in some sense, this bias means that using pure tones will rapidly produce a coarse estimate of the gammatone filters. However, since the pure tones are highly correlated, it is difficult to remove these correlations from the estimated receptive field and resolve the finer structure of the filters. This behavior is evident in Figure 12, which shows that after 1000 trials, the MSE for the pure tones design does not decrease as fast as for the alternative designs.

Figure 12:

The MSE of the estimated filters shown in Figure 11. (a) MSE of . (b) MSE of . The solid black and dashed black lines show the results for designs that picked the optimal stimulus in and , respectively. The solid gray line is for pure tones. The dashed gray line is for an i.i.d. design.

Figure 12:

The MSE of the estimated filters shown in Figure 11. (a) MSE of . (b) MSE of . The solid black and dashed black lines show the results for designs that picked the optimal stimulus in and , respectively. The solid gray line is for pure tones. The dashed gray line is for an i.i.d. design.

Also evident in the infomax results is the exploitation-exploration trade-off (Kaelbling, Littman, & Moore, 1996). To increase the information about one of the expected filters, we need to pick stimuli that are correlated with this filter. Since the input filters are orthogonal and the stimulus power is constrained, we can only effectively probe one filter at a time. The exploitation-exploration trade-off explains why on trials 100 to 500, the estimate of the first filter improves much more than the second filter. On these trials, the algorithm exploits its knowledge of the first filter rather than searching for other filters. After roughly 500 trials, exploring becomes more rewarding than exploiting our estimate of . Hence, the infomax design picks stimuli orthogonal to the first gammatone filter, which eventually leads to our finding the second filter.

6.2.  Time-Varying .

Neural responses often change slowly over the course of an experiment due to changes in the health, arousal, or attentive state of the preparation (Lesica & Stanley, 2005). If we knew the underlying dynamics of , we could try to model these changes. Unfortunately, incorporating arbitrary, nonlinear dynamical models of into our information-maximizing strategy is nontrivial because we would have to compute and maximize the expectation of the mutual information with respect to the unobserved changes in . Furthermore, even when we expect that is varying systematically, we often have very little a priori knowledge about these dynamics. Therefore, instead of trying to model the actual changes in , we simply model the fact that the changes in will cause our uncertainty about to increase over time in the absence of additional observations. We can capture this increasing uncertainty by assuming that after each trial changes in some small and unknown way (Ergun, Barbieri, Eden, Wilson, & Brown, 2007),
formula
6.10
where is normally distributed with a known mean and covariance matrix, Π. Using this simple model, we can factor into our optimization the loss of information about due to its unobserved dynamics. Our use of gaussian noise can be justified using a maximum entropy argument. Since the gaussian distribution maximizes the entropy for a particular mean and covariance, we are in some sense overestimating the loss of information due to changes in . As a result, our uncertainty no longer converges to zero even asymptotically. This is the key property that our model must capture to ensure that our infomax algorithm will pick optimal stimuli. If we assume is constant, then we would underestimate our uncertainty and, by extension, the amount of new information each stimulus would provide. Consequently, the infomax algorithm would do a poor job of picking the optimal stimulus.

To update the posterior and choose the optimal stimulus, we use the procedures described in sections 3 and 5. The only difference due to a time-varying is that the covariance matrix of is in general no longer just a rank 1 modification of the covariance matrix of . Therefore, we cannot use the rank 1 update to compute the eigendecomposition. However, since we may not have any a priori knowledge about the direction of changes in , it is often reasonable to assume has mean zero and white covariance matrix, Π = cI. In this case, the eigenvectors of Ct + Π are those of Ct, and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct; in this case, our methods may be applied without modification. In cases where we expect that varies systematically, we could try to model those dynamics more accurately by selecting an appropriate mean and covariance matrix for .

Figure 13 shows the results of using an infomax design to fit a GLM to a neuron whose receptive field drifts nonsystematically with time. The receptive field was a one-dimensional Gabor function whose center moved according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). Although only the center of moved, we still modeled changes in using equation 6.10. The results demonstrate the benefits of using an infomax design to estimate a time-varying . Although we cannot reduce our uncertainty below a level determined by Π, the infomax design can still improve our estimate of compared to using random stimuli.

Figure 13:

Estimating the receptive field when is not constant. (a) The posterior means and true plotted after each trial. was 100-dimensional, with its components following a Gabor function. To simulate slow drifts in eye position, the center of the Gabor function was moved according to a random walk between trials. We modeled the changes in as a random walk with a white covariance matrix, Π, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the estimated using stimuli chosen to maximize the information under the (mistaken) assumption that was constant. Each row of the images plots using intensity to indicate the value of the different components. (b) Details of the posterior means on selected trials. (c) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of more effectively. The infomax design selected the optimal stimulus from the sphere . The i.i.d. design picked stimuli by uniformly sampling this sphere.

Figure 13:

Estimating the receptive field when is not constant. (a) The posterior means and true plotted after each trial. was 100-dimensional, with its components following a Gabor function. To simulate slow drifts in eye position, the center of the Gabor function was moved according to a random walk between trials. We modeled the changes in as a random walk with a white covariance matrix, Π, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the