Adaptively optimizing experiments has the potential to significantly reduce the number of trials needed to build parametric statistical models of neural systems. However, application of adaptive methods to neurophysiology has been limited by severe computational challenges. Since most neurons are high-dimensional systems, optimizing neurophysiology experiments requires computing high-dimensional integrations and optimizations in real time. Here we present a fast algorithm for choosing the most informative stimulus by maximizing the mutual information between the data and the unknown parameters of a generalized linear model (GLM) that we want to fit to the neuron's activity. We rely on important log concavity and asymptotic normality properties of the posterior to facilitate the required computations. Our algorithm requires only low-rank matrix manipulations and a two-dimensional search to choose the optimal stimulus. The average running time of these operations scales quadratically with the dimensionality of the GLM, making real-time adaptive experimental design feasible even for high-dimensional stimulus and parameter spaces. For example, we require roughly 10 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Despite using some approximations to make the algorithm efficient, our algorithm asymptotically decreases the uncertainty about the model parameters at a rate equal to the maximum rate predicted by an asymptotic analysis. Simulation results show that picking stimuli by maximizing the mutual information can speed up convergence to the optimal values of the parameters by an order of magnitude compared to using random (nonadaptive) stimuli. Finally, applying our design procedure to real neurophysiology experiments requires addressing the nonstationarities that we would expect to see in neural responses; our algorithm can efficiently handle both fast adaptation due to spike history effects and slow, nonsystematic drifts in a neuron's activity.
In most neurophysiology experiments, data are collected according to a design that is finalized before the experiment begins. During the experiment, the data already collected are rarely analyzed to evaluate the quality of the design. These data, however, often contain information that could be used to redesign experiments to better test hypotheses (Fedorov, 1972; Chaloner & Verdinelli, 1995; Kontsevich & Tyler, 1999; Warmuth et al., 2003; Roy, Ghosal, & Rosenberger in press). Adaptive experimental designs are particularly valuable in domains where data are expensive or limited. In neuroscience, experiments often require training and caring for animals, which can be time-consuming and costly. As a result of these costs, neuroscientists are often unable to conduct large numbers of trials using different subjects. The inability to collect enough data makes it difficult for them to investigate high-dimensional, complex neural systems. By using adaptive experimental designs, neuroscientists could potentially collect data more efficiently. In this article, we develop an efficient algorithm for optimally adapting the experimental design in one class of neurophysiology experiments.
We can describe a model that incorporates all of these features by specifying the conditional distribution of the responses given the input. This distribution gives the probability of observing response rt at time t given the input . We use a distribution as opposed to a deterministic function to specify the relationship between rt and because a neuron's response varies for repeated presentations of a stimulus. To simplify the model, we restrict our consideration to parametric distributions that lie in some space Θ. Each vector denotes a particular model in this space. To fit a model, , to a neuron, we need to find the best value of .
We estimate by observing the neuron's response to various stimuli. For these experiments, the design is a procedure for picking the stimulus on each trial. The design can be specified as a probability distribution, , from which we sample the stimulus on each trial. Nonrandom designs can be specified by putting all the probability mass on a single stimulus. A sequential design modifies this distribution after each observation. In contrast, the standard nonsequential approach is to fix this distribution before the experiment starts, and then select the stimulus on each trial by drawing independent and identically distributed (i.i.d.) samples from . Figure 1 provides a schematic of the sequential approach we want to implement, as well as a diagram of the typical i.i.d. design.
Designing experiments by maximizing the mutual information is computationally challenging. The information we expect to gain from an experiment depends on what we have already learned from past observations. To extract the information from past observations, we need to compute the posterior distribution after each trial. Once we have updated the posterior, we need to use it to compute the expected information gain from future experiments; this requires a high-dimensional integration over the space Θ. Maximizing this integral with respect to the design requires a nonlinear search over the high-dimensional stimulus space, . In sensory neurophysiology, the stimulus space is high-dimensional because the stimuli tend to be complex, spatiotemporal signals like movies and sounds. The challenge of evaluating this high-dimensional integral and solving the resulting nonlinear optimization has impeded the application of adaptive experimental design to neurophysiology. In the worst case, the complexity of these operations will grow exponentially with the dimensionality of and . For even moderately sized spaces, direct computation will therefore be intractable, particularly if we wish to adapt the design in a real-time application.
The main contribution of this article is to show how these computations can be performed efficiently when Θ is the space of generalized linear models (GLM) and the posterior distribution on is approximated as a gaussian. Our solution depends on some important log-concavity and rank-one properties of our model. These properties justify the gaussian approximation of the posterior distribution and permit a rapid update after each trial. These properties also allow optimization of the mutual information to be approximated by a tractable two-dimensional problem that can be solved numerically. The solution to this 2D optimization problem depends on the stimulus domain. When the stimulus domain is defined by a power constraint, we can easily find the nearly optimal design. For arbitrary stimulus domains, we present a general algorithm for selecting the optimal stimulus from a finite subset of stimuli in the domain. Our analysis leads to efficient heuristics for constructing this subset to ensure the resulting design is close to the optimal design.
Our algorithm facilitates estimation of high-dimensional systems because picking more informative designs leads to faster convergence to the best model of the neuron. In our simulations (see section 5.4), the optimal design converges more than an order of magnitude faster than an i.i.d. design. Our algorithm can be applied to high-dimensional, real-time applications because it reduces the complexity with respect to dimensionality from exponential to on average quadratic running time.
This article is organized as follows. In section 2, we present the GLM of neural systems. In section 3, we present an online method for computing a gaussian approximation of the posterior distribution on the GLM's parameters. In section 4, we show how the mutual information, , can be approximated by a much simpler, low-dimensional function. In section 5, we present the procedure for picking optimal stimuli and show some simulation results. In section 6, we generalize our basic methods to some important extensions of the GLM needed to handle more complicated experiments. In section 7, we show that our algorithm asymptotically decreases the uncertainty about at a rate nearly equal to the optimal rate predicted by a general theorem on the rate of convergence of information maximizing designs (Paninski, 2005). We therefore conclude that this efficient (albeit approximate) implementation produces designs that are in fact asymptotically optimal. Simulations investigating the issue of model misspecification are presented in section 8. Finally, we discuss some limitations and directions for future work in section 9. To help the reader, we summarize in Table 1 the notation that we will use in the rest of the article.
2. The Parametric Model
For the model space, Θ, we choose the set of generalized linear models (GLM) (see Figure 2). The GLM is a tractable and flexible parametric family that has proven useful in neurophysiology (McCullagh & Nelder, 1989; Simoncelli, Paninski, Pillow, & Schwartz, 2004; Paninski, 2004; Truccolo et al., 2005; Paninski, Pillow, & Lewi, 2007). GLMs are fairly natural from a physiological point of view, with close connections to biophysical models such as the integrate-and-fire cell. Consequently, they have been applied in a wide variety of experimental settings (Brillinger, 1988, 1992; Chichilnisky, 2001; Theunissen et al., 2001; Paninski, Shoham, Fellows, Hatsopoulos, & Donoghue, 2004).
3. Representing and Updating the Posterior
Our first computational challenge is representing and updating the posterior distribution on the parameters, . We use a fast, sequential procedure for constructing a gaussian approximation of the posterior, (see Figure 3). This gaussian approximation leads to an update that is both efficient and accurate enough to be used online for picking optimal stimuli.
A gaussian approximation of the posterior is justified by the fact that the posterior is the product of two smooth, log-concave terms—the GLM likelihood function and the prior (which we assume to be gaussian, for simplicity). As a result, the log posterior is concave (i.e., it always curves downward) and can be well approximated by the quadratic expression for the log of a gaussian. Furthermore, the main result of Paninski (2005) is a central limit-like theorem for optimal experiments based on maximizing the mutual information. This theorem guarantees that asymptotically, the gaussian approximation of the posterior will be accurate.
We fit the log of a gaussian to the approximation of the log posterior in equation 3.2, using the Laplace method (Berger, 1985; MacKay, 2003). This recursive approach is much faster, albeit slightly less accurate, then using the Laplace method to fit a gaussian distribution to the true posterior. The running time of this recursive update is O(d2), whereas fitting a gaussian distribution to the true posterior is O(td3). Since t and d are large, easily O(103), the computational savings of the recursive approach are well worth the slight loss of accuracy. If the dimensionality is low, d = O(10), we can measure the error by using Monte Carlo methods to compute the Kullback-Leibler distance between the true posterior and our gaussian approximation. This analysis (results not shown) reveals that the error is small and rapidly converges to zero.
Asymptotically under suitable regularity conditions, the mean of our gaussian is guaranteed to converge to the true . Consistency can be established by applying theorems for the consistency of estimators based on stochastic gradient descent (Fabian, 1978; Sharia, 2007). We used numerical simulations (data not shown) to verify the predictions of these theorems. To apply these theorems to our update, we must be able to restrict to a closed and bounded space. Since all corresponding to neural models would naturally be bounded, this constraint is satisfied for all biologically reasonable GLMs.
Our update uses the Woodbury lemma, which is unstable when Ct is close to being singular. When optimizing under a power constraint (see section 5.2), we can avoid using the Woodbury lemma by computing the eigendecomposition of the covariance matrix. Since we need to compute the eigendecomposition in order to optimize the stimulus, no additional computation is required in this case. When the eigendecomposition was not needed for optimization, we usually found that the Woodbury lemma was sufficiently stable. However, a more stable solution in this case would have been to compute and maintain the Cholesky decomposition of the covariance matrix (Seeger, Steinke, & Tsuda, 2007).
4. Computing the Mutual Information
Before diving into a detailed mathematical computation, we want to provide a less technical explanation of our approach. Before we conduct any trials, we have a set, Θ, of possible models. For any stimulus, each model in Θ makes a prediction of the response. To identify the best model, we should pick a stimulus that maximizes the disagreement between the predictions of the different models. In theory, we could measure the disagreement for any stimulus by computing the predicted response for each model. However, since the number of possible models is large, explicitly computing the response for each model is rarely possible.
We can compute the mutual information efficiently because once we pick a stimulus, we partition the model space, Θ, into equivalent sets with respect to the predicted response. Once we fix , the likelihood of the responses varies only with the projection . Hence, all models with the same value for ρt+1 make the same prediction. Therefore, instead of computing the disagreement among all models in Θ space, we only have to compute the disagreement between the models in these different subspaces; that is, at most, we have to determine the response for one model in each of the subspaces defined by ρt+1 = const.
Of course, the mutual information also depends on what we already know about the fitness of the different models. Since our experiment provides no information about in directions orthogonal to , our uncertainty in these directions will be unchanged. Therefore, the mutual information will only depend on the information we have about in the direction ; that is, it depends only on instead of our full posterior .
Furthermore, we have to evaluate the mutual information only for nonrandom designs because any optimal design must place all of its mass on the stimulus, , which maximizes the conditional mutual information (MacKay, 1992; Paninski, 2005). This property means we can focus on the simpler problem of efficiently evaluating as a function of the input .
Equation 4.14 is a fairly intuitive metric for rating the informativeness of different designs. To distinguish among models, we want the response to be sensitive to . The information increases with the sensitivity because as the sensitivity increases, small differences in produce larger differences in the response, making it easier to identify the correct model. The information, however, also depends on the variability of the responses. As the variability of the responses increases, the information decreases because it is harder to determine which model is more accurate. The Fisher information, D(rt+1, ρt+1), takes into account both the sensitivity and the variability. As the sensitivity increases, the second derivative of the log likelihood increases because the peak of the log likelihood becomes sharper. Conversely, as the variability increases, the log likelihood becomes flatter, and the Fisher information decreases. Hence, D(rt+1, ρt+1) measures the informativeness of a particular response. However, information is valuable only if it tells us something we do not already know. In our objective function, σ2ρ measures our uncertainty about the model. Since our objective function depends on the product of the Fisher information and our uncertainty, our algorithm will favor experiments providing large amounts of new information.
In equation 4.14 we have reduced the mutual information to a two-dimensional integration over ρt+1 and rt+1, which depends on (μρ, σ2ρ). While 2D numerical integration is quite tractable, it could potentially be too slow for real-time applications. A simple solution is to precompute this function before training begins on a suitable 2D region of (μρ, σ2ρ) and then use a lookup table during our experiments.
4.1. Special Case: Exponential Nonlinearity.
4.2. Linear Model.
The optimal design for minimizing the posterior entropy of for the standard linear model is a well-known result in the statistics and experimental design literature (MacKay, 1992; Chaloner & Verdinelli, 1995). It is enlightening to rederive these results using the methods we have introduced so far and to point out some special features of the standard linear case.
Like the Poisson model with an exponential link function, the linear model's Fisher information is independent of the response. However, for the linear model, the Fisher information is also independent of the model parameters. Since the Fisher information is independent of the parameters, an adaptive design offers no benefit because we do not need to know the parameters to select the optimal input. In contrast, for the Poisson distribution with an exponential link function, the Fisher information depends on the parameters and the input, even though it is independent of the responses. As a result, we can improve our design by adapting it as our estimate of improves.
5. Choosing the Optimal Stimulus
We first present a general procedure for handling arbitrary stimulus domains. This procedure selects the optimal stimulus from a set, , which is a subset of . contains a finite number of inputs; its size will be denoted . Picking the optimal input in is easy. We simply compute (μρ, σ2ρ) for each .
Picking the optimal stimulus in a finite set, , is flexible and straightforward. The informativeness of the resulting design, however, is highly dependent on how is constructed. In particular, we want to ensure that with high probability, contains inputs in that are nearly optimal. If we could compute , then we could avoid the problem of picking a good . One case in which we can compute is when is defined by a power constraint; that is, is a sphere. Since we can compute , we can optimize the input over its full domain. Unfortunately, our method for computing cannot be applied to arbitrary input domains.
5.1. Optimizing over a Finite Set of Stimuli.
Our first method simultaneously addresses two issues: how to deal with arbitrary stimulus domains and what to do if the stimulus domain is ill defined. In general, we expect that more efficient procedures for mapping a stimulus domain into could be developed by taking into account the actual stimulus domain. However, a generalized procedure is needed because efficient algorithms for a particular stimulus domain may not exist, or their development may be complex and time-consuming. Furthermore, for many stimulus domains (i.e., natural images), we have many examples of the stimuli but no quantitative constraints that define the domain. An obvious solution to both problems is to simply choose the best stimulus from a subset of examples, .
The challenge with this approach is picking the set . For the optimization to be fast, needs to be sufficiently small. However, we also want to ensure that contains an optimal or nearly optimal input. In principle, this second criterion means should contain a large number of stimuli evenly dispersed over . We can in fact satisfy both requirements because the informativeness of a stimulus depends on only (μρ, σ2ρ). Consequently, we can partition into sets of equally informative experiments based on the value of (μρ, σ2ρ). When constructing , there is no reason to include more than one input for each value of (μρ, σ2ρ) because all of these inputs are equally informative. Hence, to ensure that contains a nearly optimal input, we just need its stimuli to span the two-dimensional and not the much higher-dimensional space, .
Although and Ct change with time, these quantities are known when optimizing . Hence, the mapping is known and easy to evaluate for any stimulus. We can use this knowledge to develop simple heuristics for selecting inputs that tend to be dispersed throughout . We delay until sections 5.3 and 6.1 the presentation of the heuristics that we used in our simulations so that we can first introduce the specific problems and stimulus domains for which these heuristics are suited.
5.2. Power Constraint.
Ideally, we would like to optimize the input over its full domain as opposed to restricting ourselves to a subset of inputs. Here we present a method for computing when is defined by the power constraint .2 This is an important stimulus domain because of its connection to white noise, which is often used to study sensory systems (Eggermont, 1993; Cottaris & De Valois, 1998; Chichilnisky, 2001; Dayan Abbot, 2001; Wu, David, & Gallant, 2006). Under an i.i.d. design, the stimuli sampled from resemble white noise. The primary difference is that we strictly enforce the power constraint, whereas for white noise, the power constraint applies only to the average power of the input. The domain is also worth considering because it defines a large space that includes many important subsets of stimuli such as random dot patterns (DiCarlo, Johnson, & Hsiao, 1998).
By solving equations 5.6 and 5.7, we can walk along the curves that define the upper and lower boundaries of as a function of μρ. To move along these curves, we simply adjust the value of the linear constraint. As we walk along these curves, the quadratic constraint ensures that we do not violate the power constraint that defines the stimulus domain.
To maximize and minimize equation 5.8 subject to the quadratic constraint , we use the Karush-Kuhn-Tucker (KKT) conditions. For these optimization problems, it can be proved that the KKT are necessary and sufficient (Fortin, 2000). To compute the boundary of as a function of μρ, we need to solve the KKT for each value of μρ. This approach is computationally expensive because for each value of μρ, we need to find the value of the Lagrange multiplier by finding the root of a nonlinear function. We have devised a much faster solution based on computing μρ as a function of the Lagrange multiplier; the details are in appendix A. This approach is faster because to compute μρ as a function of the Lagrange multiplier, we need only find the root of a 1D quadratic expression.
To solve the KKT conditions, we need the eigendecomposition of A. Computing the eigendecomposition of A is the most expensive operation and, in the worst case, requires O(d3) operations. A, however, is a rank 2 perturbation of Ct, equation A.11. When these perturbations are orthogonal to some of the eigenvectors of Ct, we can reduce the number of computations needed to compute the eigendecomposition of Ct by using the Gu-Eisenstat algorithm (Gu & Eisenstat, 1994), as discussed in the next section. The key point is that we can on average compute the eigendecomposition in O(d2) time.
5.3. Heuristics for the Power Constraint.
Although we can compute when , efficient heuristics for picking subsets of stimuli are still worth considering. If the size of the subset of stimuli is small enough, then computing (μρ, σ2ρ) for each stimulus in the subset is usually faster than computing for the entire stimulus domain. Since we can set the size of the set to any positive integer, by decreasing the size of the set we can sacrifice accuracy, in terms of finding the optimal stimulus, for speed.
We developed a simple heuristic for constructing finite subsets of by taking linear combinations of the mean and maximum eigenvector. To construct a subset, , of the closed ball, we use the following procedure:
Generate a random number, ω, uniformly from the interval [ − m, m], where m2 is the stimulus power.
Generate a random number, ϕ, uniformly from the interval .
Add the input to , where . is the maximum eigenvector of Cx,t.
This procedure tends to produce a set of stimuli that are dispersed throughout . By varying the projection of along the MAP, the heuristic tries to construct a set of stimuli for which the values of μρ are uniformly distributed on the valid interval. Similarly, by varying the projection of each stimulus along the maximum eigenvector, we can adjust the value of σ2ρ for each stimulus. Unfortunately, the subspace of the stimulus domain spanned by the mean and max eigenvector may not contain the stimuli that map to the boundaries of . Nonetheless, since this heuristic produces stimuli that tend to be dispersed throughout , we can usually find a stimulus in that is close to being optimal.
When the mutual information is increasing with σ2ρ, we can easily improve this heuristic. In this case, the optimal stimulus always lies on the sphere . Therefore, when constructing the stimuli in a finite set, we should pick only stimuli that are on this sphere. To construct such a subset, , we use the heuristic above except we set . Since the mutual information for the exponential Poisson model is increasing with σ2ρ, our simulations for this model will always use as opposed to .
We could also have constructed subsets of the stimulus domain, , by uniformly sampling the ball or sphere. Unfortunately, this process produces sets that rarely contain highly informative stimuli, particularly in high dimensions. Since the uniform distribution on the sphere is radially symmetric, and the covariance matrix of is diagonal with entries . As a result, the variance of μρ, decreases as 1/d, ensuring that for high-dimensional systems, the stimuli in have μρ close to zero with high probability (see Figure 4). Uniformly sampling the ball or sphere therefore does a poor job of selecting stimuli that are dispersed throughout . As a result, is unlikely to contain stimuli close to being maximally informative.
5.4. Simulation Results.
We tested our algorithm using computer simulations that roughly emulated typical neurophysiology experiments. The main conclusion of our simulations is that using our information-maximizing (infomax) design, we can reduce by an order of magnitude the number of trials needed to estimate (Paninski, 2005). This means we can increase the complexity of neural models without having to increase the number of data points needed to estimate the parameters of these higher-dimensional models. Furthermore, our results show that we can perform the computations fast enough—between 10 m and 1 sec depending on —that our algorithm could be used online, during an experiment, without requiring expensive or custom hardware.
Our first simulation used our algorithm to learn the receptive field of a visually sensitive neuron. The simulation tested the performance of our algorithm with a high-dimensional input space. We took the neuron's receptive field to be a Gabor function as a proxy model of a V1 simple cell (Ringach, 2002). We generated synthetic responses by sampling equation 2.3 with set to a 40 × 40 Gabor patch. The nonlinearity was the exponential function.
Plots of the posterior means (recall these are equivalent to the MAP estimate of ) for several designs are shown in Figure 5. The results show that all infomax designs do better than an i.i.d. design, and an infomax design that optimizes over the full domain of the input, , does much better than choosing the best stimulus in a subset constructed by uniformly sampling .
The results in Figures 5 and 6 show that if we choose the optimal stimulus from a finite set, then intelligently constructing the set is critical to achieving good performance. We compared two approaches for creating the set when . The first approach selected a set of stimuli, , by uniformly sampling . The second approach constructed a set for each trial using the heuristic presented in section 5.3. Picking the optimal stimulus in produced much better estimates of than picking the optimal stimulus in . In particular, the design using converged to nearly as fast as the design that optimized over the full stimulus domain, . These results show that using is more efficient than reusing the same set of stimuli for all trials. To achieve comparable results using , we would have to increase the number of stimuli by several orders of magnitude. Consequently, the added cost of constructing a new stimulus set after each trial is more than offset by our ability to use fewer stimuli compared to using a constant set of stimuli.
We also compared the infomax designs to the limiting cases where we put all stimulus energy along the mean or maximum eigenvector (see Figures 5 and 6). Putting all energy along the maximum eigenvector performs nearly as well as an i.i.d. design. Our update, equation 3.12, ensures that if the stimulus is an eigenvector of Ct, the updated covariance matrix is the result of shrinking the eigenvalue corresponding to that eigenvector. Consequently, setting the stimulus to the maximum eigenvector ends up scanning through the different eigenvectors on successive trials. The resulting sequence of stimuli is statistically similar to that of an i.i.d. design because the stimuli are highly uncorrelated with each other and with . As a result, both methods generate similar marginal distributions with sharp peaks at 0. Since the Fisher information of a stimulus under the power constraint varies only with , both methods pick stimuli that are roughly equally informative. Consequently, both designs end up shrinking the posterior entropy at similar rates.
In contrast, making the stimulus on each trial parallel to the mean leads to a much slower initial decrease of the posterior entropy. Since our initial guess of the mean is highly inaccurate, is close to zero, resulting in a small value for the Fisher information. Furthermore, sequential stimuli end up being highly correlated. As a result, we converge very slowly to the true parameters.
We also evaluated a design that used sinusoidal gratings as the stimuli. In Figure 5, this design produces an estimate of that already has the basic inhibitory and excitatory pattern of the receptive field after just 1000 trials. However, on the remaining trials improves very little. Figure 6 shows that this design decreases the entropy at roughly the same rate as the i.i.d. design. The reason the coarse structure of the receptive field appears after so few trials is that the stimuli have a large amount of spatial correlation. This spatial correlation among the stimuli induces a similar correlation among the components of the MAP and explains why the coarse inhibitory and excitatory pattern of the receptive field appears after so few trials. However, it also makes it difficult to estimate the higher-resolution features of , which is why does not improve much between 1000 and 5000 trials.
Similar results to Figure 5 in Paninski (2005) used a brute force computation and optimization of the mutual information. The computation in Paninski (2005) was possible only because was assumed to be a Gabor function specified by just three parameters (the 2D location of its center and its orientation). Similarly, the stimuli were constrained to be Gabor functions. Our simulations did not assume that or was Gabor. could have been any 40 × 40 image with power m2. Attempting to use brute force in this high-dimensional space would have been hopeless. Our results show that a sequential optimal design allows us to perform system identification in high-dimensional spaces that might otherwise be tractable only by making strong assumptions about the system.
The fact that we can pick the stimulus to increase the information about the parameters, , that determine the dependence of the firing rate on the stimulus is unsurprising. Since we are free to pick any stimulus, by choosing an appropriate stimulus we can distinguish among different values of . Our GLM, however, can also include spike history terms. Since we cannot fully control the spike history, a reasonable question is whether infomax can improve our estimates of the spike history coefficients, . Figure 7 shows the results of a simulation characterizing the receptive field of a neuron whose response depends on its past spiking. The unknown parameter vector, , consists of the stimulus coefficients , which were a 1D Gabor function, and the spike history coefficients, , which were inhibitory and followed an exponential function. The nonlinearity was the exponential function.
The results in Figure 7 show that an infomax design leads to better estimates of both and . Figure 7 shows the MAPs of both methods on different trials, as well as the mean squared error (MSE) on all trials. In Figure 7, the MSE increases on roughly the first 100 trials because the mean of the prior is zero. The data collected on these early trials tend to increase the magnitude of . Since the true direction of is still largely unknown, the increase in the magnitude of tends to increase the MSE.
By converging more rapidly to the stimulus coefficients, the infomax design produces a better estimate of how much of the response is due to , which leads to better estimates of . The size of this effect is measured by the correlation between and , which is given by Cx,f in equation A.3. Consider a simple example where the first entry of Cx,f is negative and the remaining entries are zero. In this example, and (the first components of and , respectively) would be anticorrelated. This value of Cx,f roughly means that the log posterior remains relatively constant if we increase but decrease . If we knew the value of , then we would know where along this line of equal probability the true parameters were located. As a result, increasing our knowledge about also reduces our uncertainty about .
5.4.1. Running Time.
Our algorithm is suited to high-dimensional, real-time applications because it reduces the exponential complexity of choosing the optimal design to on average quadratic and at worst cubic running time. We verified this claim empirically by measuring the running time for each step of the algorithm as a function of the dimensionality of , Figure 8a.3 These simulations used a GLM with an exponential link function. This nonlinearity leads to a special case of our algorithm because we can derive an analytical approximation of our objective function, equation 4.20, and only a one-dimensional search in is required to find the optimal input. These properties facilitate implementation but do not affect the complexity of the algorithm with respect to d. Using a lookup table instead of an analytical expression to estimate the mutual information as a function of (μρ, σ2ρ) would not change the running time with respect to d because is always 2D. Similarly, the increased complexity of a full 2D search compared to a 1D search in is independent of d.
The main conclusion of Figure 8a is that the complexity of our algorithm on average grows quadratically with the dimensionality. The solid black line shows a polynomial of degree 2 fitted to the total running time. We also measured the running time of the four steps that make up our algorithm: (1) updating the posterior, (2) computing the eigendecomposition of the covariance matrix, (3) modifying the quadratic form for σ2ρ to eliminate the linear constraint (i.e., finding the eigendecomposition of A in equation A.11) and (4) finding the optimal stimulus. The solid lines indicate fitted polynomials of degree 1 for optimizing the stimulus and degree 2 for the remaining curves. Optimizing the stimulus entails searching along the upper boundary of for the optimal pair (μ*ρ, σ2*ρ) and then finding an input that maps to (μ*ρ, σ2*ρ). The running time of these operations scales as O(d) because computing σ2ρ ,max as a function of λ requires summing d terms, equation A.17. When was 100 dimensions, the total running time was about 10 ms, which is within the range of tolerable latencies for many experiments. Consequently, these results support our conclusion that our algorithm can be used in high-dimensional, real-time applications.
If the perturbation leaves most of our eigenvectors and eigenvalues unchanged, then we can use the Gu-Eisenstat algorithm to compute fewer than d eigenvalues and eigenvectors, thereby achieving on average quadratic running time (Gu & Eisenstat, 1994; Demmel, 1997; Seeger, 2007). Asymptotically, we can prove that the perturbation is correlated with at most two eigenvectors (see section 7). Consequently, asymptotically we need to compute at most two new eigenvectors on each trial. These asymptotic results, however, are not as relevant for the actual running time as empirical results. In Figure 8b, we plot, for one simulation, the number of eigenvectors that are perturbed by the rank 1 modification. On most trials, fewer than d eigenvectors are perturbed by the update. These results rely to some extent on the fact that our prior covariance matrix was white and hence had only one distinct eigenvalue. On each subsequent iteration, we can reduce the multiplicity of this eigenvalue by at most one. Our choice of prior covariance matrix therefore helps us manage the complexity of the eigendecomposition.
6. Important Extensions
In this section we consider two extensions of the basic GLM that expand the range of neurophysiology experiments to which we can apply our algorithm: handling nonlinear transformations of the input and dealing with time-varying . In both cases, our method for picking the optimal stimulus from a finite set requires only slight modifications. Unfortunately, our procedure for picking the stimulus under a power constraint will not work if the input is pushed through a nonlinearity.
6.1. Input Nonlinearities.
NLN models are frequently used to explain how sensory systems process information. In vision, for example, MT cells can be modeled as a GLM whose input is the output of a population of V1 cells (Rust et al., 2006). In this model, V1 is modeled as a population of tuning curves whose output is divisively normalized. Similarly in audition, cochlear processing is often represented as a spectral decomposition using gammatone filters (de Boer & de Jongh, 1978; Patterson et al., 1992; Lewicki, 2002; Smith & Lewicki, 2006). NLN models can be used to model this spectral decomposition of the auditory input, as well as the subsequent integration of information across frequency (Gollisch, Schutze, Benda, & Herz, 2002). One of the most important NLN models in neuroscience is the energy model. In vision, energy models are used to explain the spatial invariance of complex cells in V1 (Adelson & Bergen, 1985; Dayan & Abbot, 2001). In audition, energy models are used to explain frequency integration and phase insensitivity in auditory processing (Gollisch et al., 2002; Carlyon & Shamma, 2003).
Using the functions, Wi, to project the input into feature space does not affect our strategy for picking the optimal stimulus from a finite set. We simply have to compute for each stimulus before projecting it into and computing the mutual information. Our solution for optimizing the stimulus under a power constraint, however, no longer works for two reasons. First, a power constraint on does not in general translate into a power constraint on the values of . As a result, we cannot use the algorithm of section 5.2 to find the optimal values of . Second, assuming we could find the optimal values of , we would need to invert to find the actual stimulus. For many nonlinearities, the energy model being one example, is not invertible.
To estimate the parameters of an energy model, we use our existing update method to construct a gaussian approximation of the posterior in feature space, . We can then use the MAP to estimate the input filters . The first step is to rearrange the terms of the mean, , as a matrix, . We then estimate the input filters, , by computing the singular value decomposition (SVD) of . If converges to the true value, then the subspace corresponding to its nonzero singular values should equal the subspace spanned by the true filters, .
We randomly pick an eigenvector, , of with the probability of picking each eigenvector being proportional to the relative energy of the corresponding eigenvalue.
We pick a random number, ω, by uniformly sampling the interval [ − m, m], where m2 is the maximum allowed stimulus power.
We choose a direction, , orthogonal to by uniformly sampling the d − 1 unit sphere orthogonal to .
This heuristic works because for the energy model, measures the energy of the stimulus in feature space. For this model, feature space is defined by the eigenvectors of Q. Naturally, if we want to increase ρt+1, we should increase the energy of the stimulus along one of the basis vectors of feature space. The eigenvectors of are our best estimate for the basis vectors of feature space. Hence, μρ, the expected value of ρt+1, varies linearly with the energy of the input along each eigenvector of , equation 6.7.
The effectiveness of our heuristic is illustrated in Figure 10. This figure illustrates the mapping of stimuli into space for stimulus sets constructed using our heuristic, , and stimulus sets produced by uniformly sampling the sphere, . Our heuristic produces a set of stimuli that is more spread out on the range of μρ. As a result, contains more informative stimuli than .
6.1.1. Auditory Simulation.
We applied these estimation and optimization procedures to a simulation of an auditory neuron. We modeled the neuron using an energy model. For simplicity, our hypothetical neuron received input from just two neurons in earlier layers. We modeled these input neurons as gammatone filters that were identical except for a 90 degree difference in phase (de Boer & de Jongh, 1978; Patterson et al., 1992). We generated spikes by sampling a conditional Poisson process whose instantaneous, conditional firing rate was set by equation 6.2 with , and being the gammatone filters, and f(ρt+1) = exp(ρt+1). We estimated the parameters, Q, using an i.i.d. and two infomax designs. The i.i.d. design uniformly sampled the stimulus from the sphere . The two infomax designs picked the optimal stimulus in a subset of stimuli drawn from the sphere. In one case, this set was constructed using our heuristic, while in the other case, it was constructed by uniformly sampling the sphere.
The results of our simulations are shown in Figure 11. When finding the MAP of , we restricted such that the corresponding matrices, , were symmetric but not necessarily rank 2. The rank 2 restriction is unnecessary because the number of linear filters can be recovered from the number of nonzero singular values of . To show how well the true gammatone filters can be estimated from the principal components of , we show in Figure 11 the reconstruction of and using the first two principal components of that is, the linear combination of the projections of each filter along the first two principal components.
Figures 11 and 12 show that when the optimal stimulus in is picked, the MAP converges more rapidly to the true gammatone filters. In Figure 11, the design that uses pure tones as the inputs appears to produce good estimates of the filters. These results, however, are somewhat misleading. Since these inputs are restricted to tones, the inputs that cause the neuron to fire are highly correlated. As a result, the estimated receptive field is biased by the correlations in the input. Since gammatone filters are similar to sine waves, in some sense, this bias means that using pure tones will rapidly produce a coarse estimate of the gammatone filters. However, since the pure tones are highly correlated, it is difficult to remove these correlations from the estimated receptive field and resolve the finer structure of the filters. This behavior is evident in Figure 12, which shows that after 1000 trials, the MSE for the pure tones design does not decrease as fast as for the alternative designs.
Also evident in the infomax results is the exploitation-exploration trade-off (Kaelbling, Littman, & Moore, 1996). To increase the information about one of the expected filters, we need to pick stimuli that are correlated with this filter. Since the input filters are orthogonal and the stimulus power is constrained, we can only effectively probe one filter at a time. The exploitation-exploration trade-off explains why on trials 100 to 500, the estimate of the first filter improves much more than the second filter. On these trials, the algorithm exploits its knowledge of the first filter rather than searching for other filters. After roughly 500 trials, exploring becomes more rewarding than exploiting our estimate of . Hence, the infomax design picks stimuli orthogonal to the first gammatone filter, which eventually leads to our finding the second filter.
6.2. Time-Varying .
To update the posterior and choose the optimal stimulus, we use the procedures described in sections 3 and 5. The only difference due to a time-varying is that the covariance matrix of is in general no longer just a rank 1 modification of the covariance matrix of . Therefore, we cannot use the rank 1 update to compute the eigendecomposition. However, since we may not have any a priori knowledge about the direction of changes in , it is often reasonable to assume has mean zero and white covariance matrix, Π = cI. In this case, the eigenvectors of Ct + Π are those of Ct, and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct; in this case, our methods may be applied without modification. In cases where we expect that varies systematically, we could try to model those dynamics more accurately by selecting an appropriate mean and covariance matrix for .
Figure 13 shows the results of using an infomax design to fit a GLM to a neuron whose receptive field drifts nonsystematically with time. The receptive field was a one-dimensional Gabor function whose center moved according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). Although only the center of moved, we still modeled changes in using equation 6.10. The results demonstrate the benefits of using an infomax design to estimate a time-varying . Although we cannot reduce our uncertainty below a level determined by Π, the infomax design can still improve our estimate of compared to using random stimuli.