## Abstract

Models of neural responses to stimuli with complex spatiotemporal correlation structure often assume that neurons are selective for only a small number of linear projections of a potentially high-dimensional input. In this review, we explore recent modeling approaches where the neural response depends on the quadratic form of the input rather than on its linear projection, that is, the neuron is sensitive to the local covariance structure of the signal preceding the spike. To infer this quadratic dependence in the presence of arbitrary (e.g., naturalistic) stimulus distribution, we review several inference methods, focusing in particular on two information theory–based approaches (maximization of stimulus energy and of noise entropy) and two likelihood-based approaches (Bayesian spike-triggered covariance and extensions of generalized linear models). We analyze the formal relationship between the likelihood-based and information-based approaches to demonstrate how they lead to consistent inference. We demonstrate the practical feasibility of these procedures by using model neurons responding to a flickering variance stimulus.

## 1. Introduction

A basic challenge in sensory neuroscience has been to develop concise descriptions of how neurons encode and transmit information about the stimulus. Models that attempt to capture the essence of this neural computation—the transformation of stimuli into spiking responses—without necessarily being derived from an underlying dynamical or biophysical model of neural function are called functional models (see Wu, David, & Gallant, 2006 for an in-depth review; also Agüera y Arcas, Fairhall, & Bialek, 2003; Agüera y Arcas & Fairhall, 2003; Hong, Agüera y Arcas, & Fairhall, 2007; Lundstrom, Hong, & Fairhall, 2008; Ostojic & Brunel, 2011, for papers that link functional to dynamical models). As a consequence, functional models are usually fully learned from data, and their success depends critically on two factors: whether typical electrophysiological recordings can provide adequate data for successful inference of the model's parameters and whether effective inference algorithms exist for these parameters. Within these limitations, functional models have dramatically influenced our view of early sensory processing by mathematically summarizing the notions of receptive field, linear, and high-order stimulus sensitivity (captured by the filtering operations of matching order performed on the stimulus), as well as subsequent neural computations leading to the generation of a spike (captured by the nonlinearities operating on filter outputs).

In the functional modeling framework, the responses of many sensory neurons can be well characterized by assuming that the initial transformation of the stimulus is a linear filtering operation, that is, that the response of the neuron depends on only a single projection of the (possibly high-dimensional) stimulus **s**(*t*) onto the neuron's linear filter **k** (see Figure 1a). This view has been so successful that we tend to use the terms *filter* and *receptive field* interchangeably. For some neurons, however, their description in terms of a single linear filter is insufficient. One of the best-known examples is that of a complex cell in the primary visual cortex. Complex cells are characterized by the invariance of their responses to changes in the phase of the stimulus: their response remains constant as **s**(*t*) is changed into −**s**(*t*) by flipping dark regions of the stimulus into bright ones and vice versa. The simplest way in which such an invariance could be captured mathematically would be to assume that the stimulus **s** enters the neural response squared rather than at linear order. In other words, the stimulus sensitivity of complex cells is quadratic; it depends on the term , where **Q** is a quadratic filter specific to each cell (see Figure 1c).

Even neurons that are adequately characterized by a single linear filter when probed by relatively simple stimuli could modulate their responses strongly when high-order features of the stimulus change. For example, while retinal ganglion cells exhibit strong center-surround filters, they also change the gain of their responses with changing contrast, a second-order stimulus feature, within their receptive fields. Similar cases, where the stimulus sensitivity is purely quadratic (e.g., non-phase-locked auditory models or some motion-sensitive neurons), or where quadratic features like the shape of the signal envelope have a strong modulatory effect, are abundant in the sensory periphery. Additionally, response phenomena beyond phase invariance in the visual cortex, grouped together as relating to the nonclassical receptive field, could also be manifestations of quadratic or high-order sensitivity (Zetzsche & Nuding, 2005). This has recently sparked a lot of interest in developing generic and tractable methods for the functional characterization of neural responses where the stimulus sensitivity could be as high as second order.

In this review, we focus on neural models with quadratic stimulus sensitivity and the corresponding inference methods, emphasizing recent ones that can be applied to any stimulus ensemble, including fully naturalistic movies, and also to cases where the quadratic filter **Q** is a full-rank matrix. We start with a brief overview of methods for inferring linear stimulus sensitivity in section 2, which we extend to the discussion of models with multiple linear features in section 3. We show how quadratic stimulus dependence could arise as a special case of neural models with multiple linear features and present biologically motivated examples of quadratic stimulus sensitivity in section 4. We then review several complementary approaches that can be used to learn quadratic stimulus dependence even when neurons are responding to rich, naturalistic stimuli: we discuss the maximally informative stimulus energy (Rajan & Bialek, 2012) and the maximization of noise entropy (Fitzgerald, Rowekamp, Sincich, & Sharpee, 2011; Fitzgerald, Sincich, & Sharpee, 2011; Globerson, Stark, Vaadia, & Tishby, 2009) in sections 5.1 and 5.2 followed by the Bayesian spike-triggered covariance method (Park & Pillow, 2011) and related extensions of generalized linear models to include quadratic stimulus dependence in section 5.3.^{1} We show the conditions under which information-theoretic and likelihood-based approaches lead to consistent inference in the appendix.

## 2. Receptive Fields and Linear Stimulus Dependence

The space of all possible stimuli and the space of all possible neural responses is vast. Consider, for instance, all possible image sequences incident on the retina or sets of output spike trains. Our progress in building functional models must therefore depend on making well-chosen simplifying assumptions. An example of extreme simplification involves varying the stimulus along a single dimension, as in the case of the orientation or wavelength of a drifting grating visual stimulus, and representing the output by a single scalar quantity, like the average firing rate in a chosen time bin. These measurements have traditionally been represented in terms of tuning curves and have provided basic insights into principles of sensory (and population) coding (Dayan & Abbott, 2001). However, the relevance of the tuning curve approach is limited by the choice of the dimension along which the stimulus is manipulated, which may drastically underestimate the complexity in the structure of the stimuli to which the neuron actually responds. Despite these limitations, such studies have helped establish the concept of a receptive field, the region of stimulus space where changes in the stimulus modulate the spiking behavior of the neuron.

Central to the receptive field concept is the notion of locality in the stimulus or feature space. For instance, a ganglion cell in the retina may be sensitive only to specific changes in light intensity that occur within a small visual angle (Hartline, 1940). A productive way of capturing this notion of locality has been to think of a receptive field as one or more filters that act on the stimulus; only stimulus variations that result in measurable changes in the filter output have the ability to affect the neural response. In this view, neurons perform dimensionality reduction by projecting the stimulus down into a small number of features. Consequently, the success of data analysis techniques built around this idea must depend on whether a small number of features or filters suffices to fully account for the neuron's sensitivity and its response properties.

Methods based in systems identification theory have provided systematic procedures to infer both the receptive fields of neurons as well as subsequent computations (see Table 1 for an overview of various functional models and related inference techniques). These techniques usually share two key features. One is that they can (and sometimes must) be used with stimuli that sample the stimulus space broadly, making no explicit assumptions about which stimulus features are important. This is in contrast to the restricted stimuli employed for measuring tuning curves. The other is that the procedures usually involve a series of approximations that can provably yield a better description of the system if more data are available. Among the earliest to be used successfully, Wiener and Volterra expansions helped identify the first- and second-order kernels mapping the stimulus to response time traces in various systems (Marmarelis & Marmarelis, 1978; Recio-Spinoso, Temchin, van Dijk, Fan, & Ruggero, 2005; Sakai, 1992; Schetzen, 1989; Victor & Knight, 1979; Wiener, 1958). However, in many cases, the strong intrinsic nonlinearities underlying spike generation require a large number of terms in Wiener-Volterra expansions, even though the underlying stimulus sensitivity might be much simpler and therefore of low order.^{2} Models in which the (possibly linear) projections of the stimulus in the receptive field were decoupled from the nonlinearities underlying spike generation, as in linear-nonlinear (LN) architectures illustrated in Figure 1, made further progress feasible.

. | Method . | Stimulus Type . | Models and Restrictions . | References . |
---|---|---|---|---|

A | Wiener/Volterra expansion | Gaussian white noise, sum of sinusoids | Marmarelis & Marmarelis, 1978; Schetzen, 1989; Wiener, 1958; Recio-Spinoso et al., 2005; Victor & Knight, 1979 | |

B | Spike-triggered average (STA) (reverse correlation) | Spherically symmetric, binary noise, m-sequences | LN (single filter), isolated spikes | de Boer & Kuyper, 1968; Paninski, 2003; Reid, Victor, & Shapley, 1997; Schwartz, Pillow, Rust, & Simoncelli, 2006; Simoncelli, Paninski, Pillow, & Schwartz, 2004 |

Debiased STA (reverse correlation) | “Gaussian-like” asym. 1-point histogram | LN (single filter), isolated spikes | Lesica, Ishii, Stanley, & Hosoya, 2008 | |

Differential reverse correlation (dRC) (reverse correlation) | Spike-triggering snippet | Linear feature that predicts spike timing | Tkačik & Magnasco, 2008 | |

(maximum likelihood) | Any | Leaky integrate-and-fire (LIF/LN-LIF) | Gerstner & Kistler, 2002; Paninski et al., 2004; Pillow, 2007 | |

Error function minimization | Any | Dynamical extensions of LN | ||

(general fitting methods) | (Keat), | Keat et al., 2001; Ozuysal & Baccus, 2012 | ||

(LNK) | ||||

Generalized linear models (GLM) (maximum likelihood) | Any | Point process (dependence on past spiking) (effect of other neurons)) | Paninski, 2004; Pillow et al., 2008; Truccolo, Eden, Fellows, Donoghue, & Brown, 2004; Gerwinn, Macke, & Bethge, 2010; Pillow, 2007 | |

Isoresponse mapping | synthetic stimuli (parameterizable, low-D) | LNLN cascade | Bölinger & Gollisch, 2012; Gollisch & Herz, 2005 | |

C | Generalized linear models (GLM) (with multiple nonlinear stimulus features) | Any | point process (dependence on past spiking) | Gerwin, Macke, Seeger, & Bethge, 2008 |

Maximally informative dimensions (MID) | Any | LN (multiple filters) | Kouh & Sharpee, 2009; Sharpee et al., 2006; Sharpee, Rust, & Bialek, 2004 | |

(info maximization) | ||||

Extended projection pursuit regression (ePPR) | Any | LN (multiple filters) | Rapela, Felsen, Touryan, Mendel, & Grzywacz, 2010 | |

Spike-triggered covariance (STC) (reverse correlation) | Gaussian | LN (multiple filters), isolated spikes | de Ruyter van Steveninck & Bialek, 1988; Bialek & de Ruyter van Steveninck, 2005; Simoncelli et al., 2004; Fairhall et al., 2006; Maravall, Petersen, Fairhall, Arabzadeh, & Diamond, 2007; Schwartz et al., 2002 | |

iSTAC (parametric MID with conditional gaussian model) | Gaussian | LN (multiple filters) | Pillow & Simoncelli, 2006 | |

D | Maximization of noise entropy (convex optimization) | Any | Fitzgerald, Sincich et al., 2011; Fitzgerald, Rowekamp et al., 2011; Globerson et al., 2009 | |

Bayesian STC/quadratic GLM (likelihood maximization) | Any | additive linear and quadratic contributions | Park & Pillow, 2011 | |

Maximally informative stimulus energy (MISE) (info maximization) | Any | general quadratic model | Rajan & Bialek, 2012 |

. | Method . | Stimulus Type . | Models and Restrictions . | References . |
---|---|---|---|---|

A | Wiener/Volterra expansion | Gaussian white noise, sum of sinusoids | Marmarelis & Marmarelis, 1978; Schetzen, 1989; Wiener, 1958; Recio-Spinoso et al., 2005; Victor & Knight, 1979 | |

B | Spike-triggered average (STA) (reverse correlation) | Spherically symmetric, binary noise, m-sequences | LN (single filter), isolated spikes | de Boer & Kuyper, 1968; Paninski, 2003; Reid, Victor, & Shapley, 1997; Schwartz, Pillow, Rust, & Simoncelli, 2006; Simoncelli, Paninski, Pillow, & Schwartz, 2004 |

Debiased STA (reverse correlation) | “Gaussian-like” asym. 1-point histogram | LN (single filter), isolated spikes | Lesica, Ishii, Stanley, & Hosoya, 2008 | |

Differential reverse correlation (dRC) (reverse correlation) | Spike-triggering snippet | Linear feature that predicts spike timing | Tkačik & Magnasco, 2008 | |

(maximum likelihood) | Any | Leaky integrate-and-fire (LIF/LN-LIF) | Gerstner & Kistler, 2002; Paninski et al., 2004; Pillow, 2007 | |

Error function minimization | Any | Dynamical extensions of LN | ||

(general fitting methods) | (Keat), | Keat et al., 2001; Ozuysal & Baccus, 2012 | ||

(LNK) | ||||

Generalized linear models (GLM) (maximum likelihood) | Any | Point process (dependence on past spiking) (effect of other neurons)) | Paninski, 2004; Pillow et al., 2008; Truccolo, Eden, Fellows, Donoghue, & Brown, 2004; Gerwinn, Macke, & Bethge, 2010; Pillow, 2007 | |

Isoresponse mapping | synthetic stimuli (parameterizable, low-D) | LNLN cascade | Bölinger & Gollisch, 2012; Gollisch & Herz, 2005 | |

C | Generalized linear models (GLM) (with multiple nonlinear stimulus features) | Any | point process (dependence on past spiking) | Gerwin, Macke, Seeger, & Bethge, 2008 |

Maximally informative dimensions (MID) | Any | LN (multiple filters) | Kouh & Sharpee, 2009; Sharpee et al., 2006; Sharpee, Rust, & Bialek, 2004 | |

(info maximization) | ||||

Extended projection pursuit regression (ePPR) | Any | LN (multiple filters) | Rapela, Felsen, Touryan, Mendel, & Grzywacz, 2010 | |

Spike-triggered covariance (STC) (reverse correlation) | Gaussian | LN (multiple filters), isolated spikes | de Ruyter van Steveninck & Bialek, 1988; Bialek & de Ruyter van Steveninck, 2005; Simoncelli et al., 2004; Fairhall et al., 2006; Maravall, Petersen, Fairhall, Arabzadeh, & Diamond, 2007; Schwartz et al., 2002 | |

iSTAC (parametric MID with conditional gaussian model) | Gaussian | LN (multiple filters) | Pillow & Simoncelli, 2006 | |

D | Maximization of noise entropy (convex optimization) | Any | Fitzgerald, Sincich et al., 2011; Fitzgerald, Rowekamp et al., 2011; Globerson et al., 2009 | |

Bayesian STC/quadratic GLM (likelihood maximization) | Any | additive linear and quadratic contributions | Park & Pillow, 2011 | |

Maximally informative stimulus energy (MISE) (info maximization) | Any | general quadratic model | Rajan & Bialek, 2012 |

Notes: The table is organized into four blocks. (A) Wiener-Volterra expansions, followed by (B) methods with linear stimulus sensitivity, (C) methods designed for inferring multiple stimulus features, and (D) methods explicitly designed to capture quadratic stimulus dependence. *r* is the firing rate or the probability of spiking; **k** are linear filters acting on stimulus clips **s**; **Q** is a matrix representing a quadratic kernel; **Q** is a linear filter operating on the sequence of past spikes **y**; are arbitrary nonlinear functions; denotes a convolution; is a thresholding operation (1 when the argument crosses some threshold from below, 0 otherwise); is a white noise Langevin force. In this review, we use the term *gaussian* to denote stimuli whose components are jointly gaussian and possibly correlated (i.e., nonwhite), unless otherwise noted. While reverse correlation methods are formally simpler for uncorrelated (white) gaussian noise, it is possible to generalize them for use with correlated noise ensembles. For example, to compute an unbiased estimate of the linear (L) part of the model using STA and a correlated stimulus, we need to correct for stimulus correlations by multiplying the spike-triggered average with the inverse covariance matrix. For an extensive review of spike-triggered (reverse correlation) methods, see Schwartz et al. (2006).

LN and LN-like models have been used widely and profitably to predict the firing rate traces of single sensory neurons, because their parameters can be easily inferred under suitable conditions. However, the more intriguing cases are the ones where LN models perform poorly or fail entirely. One such failure mode is the inability to account for the statistics of neural activity beyond the mean firing rate. Specifically, real sensory neurons often have variability that is smaller than that attributed to Poisson processes (de Ruyter van Steveninck, Lewen, Strong, & Bialek, 1997); phenomena like refractoriness and spike frequency adaptation are not captured by LN models (Berry & Meister, 1998), and in neural populations, uncoupled LN models fail to reproduce the basic covariance structure of neural activity (Granot-Atedgi, Tkačik, Segev, & Schneidman, 2012; Pillow et al., 2008; Schneidman, Berry, Segev, & Bialek, 2006). Some of these issues can be addressed by adding suitable dynamical complexity beyond the linear filtering stage to make the nonlinearities in spike generation more realistic (Keat, Reinagel, Reid, & Meister, 2001; Paninski, Pillow, & Simoncelli, 2004) or by including interactions between neurons in models of neural firing (Granot-Atedgi et al., 2012; Pillow et al., 2008).

A different kind of failure of LN models rests on the assumption that stimulus sensitivity occurs through a single linear projection (or a small number of them). One example is contrast adaptation, where a simple LN model derived from a white noise stimulus of a certain variance fails to predict the response to a stimulus with smaller or larger variance (Baccus & Meister, 2002; Borst & Egelhaaf, 1987; van Hateren, 1992; de Ruyter van Steveninck, Zaagman, & Mastebroek, 1986; Smirnakis, Berry, Warland, Bialek, & Meister, 1997). Other examples include the failure to account for the sensitivity of retinal ganglion cells to fine spatial detail (possibly because of nonlinear summation within the receptive field; Demb, Zaghloul, Haarsma, & Sterling, 2001; Schwartz et al., 2012), or to stimulus motion (Berry, Brivanlou, Jordan, & Meister, 1999; Chen et al., 2012; Gollisch & Meister, 2010; Schwartz, Taylor, Fisher, Harris, & Berry, 2007). Generally these difficulties emerge clearly when the stimulus statistics change or increase in complexity beyond those used to infer the model, for instance, by becoming more “naturalistic”—that is, having pairwise temporal and spatial correlation, skewed first-order histograms, or statistical structure beyond second order.

The problems with LN models can generally be addressed in two possible ways. In the first, LN models can be extended to account for a particular phenomenon on a specific stimulus, for example, by adding a contrast gain control mechanism (Schwartz & Simoncelli, 2001; Schwartz, Chichilnisky, & Simoncelli, 2002) or by an ad hoc rescaling of nonlinearities (Brenner, de Ruyter van Steveninck, & Bialek, 2000) to account for contrast adaptation in an experiment where the variance of a gaussian input is modulated. The second approach is to find the complete (or close to complete) set of features to which the neuron responds by examining the neural responses to relatively rich noise stimuli or fully naturalistic stimuli. Noise stimulation (e.g., white noise) is analytically convenient and can provide easily obtainable unbiased estimates of linear filters, but it remains highly unnatural. While ethologically more relevant, fully natural stimuli can lead to technical obstacles in model inference, mainly due to the statistical intractability of the natural ensemble (Geisler, 2008; Simoncelli & Olshausen, 2001). The choice of the appropriate stimulus deserves a lengthier discussion (e.g., Rust & Movshon, 2005), which is beyond the scope of this review. We do, however, wish to emphasize two points. First, it has been shown that under conditions of naturalistic stimulation, even basic filter responses of cells can change (Sharpee et al., 2006), and response mechanisms that are intractable via noise stimulation become engaged (e.g., Olveczky, Baccus, & Meister, 2007). As a consequence, finding the complete set of features that characterize the neural response across different stimulus ensembles is an elusive goal, and in practice we are often satisfied with results specific to one rich stimulus type. The second point is methodological: the applicability of some inference techniques is restricted to special stimulus types, while others permit unbiased inference with arbitrary stimuli, a distinction we make explicit in the second column of Table 1.

To characterize the sensitivity of a neuron to the selected rich stimulus ensemble more fully, one can look for multiple linear filters. A number of approaches exist for this task (see Table 1). While some (e.g. spike-triggered covariance, STC) permit us to identify several relevant stimulus dimensions, understanding how the corresponding stimulus projections influence spiking output can be difficult unless we make further simplifying assumptions. One possible anatomically motivated simplification of a multifeature LN model is a cascade LN (an LNLN) model, where the nonlinearly transformed filter outputs are linearly summed and passed through a spike-generating nonlinearity. Despite some successes (Bölinger & Gollisch, 2012; Gollisch & Herz, 2005; Schwartz et al., 2012), the general problem of inferring cascading models with multiple linear filters remains technically challenging (usually involving difficult optimizations). A somewhat simpler LNL system has proven to account for the behavior of the Y-type retinal ganglion cells very well and is tractable to infer using a sum-of-sinusoids version of the Wiener formalism (Victor & Knight, 1979; Victor & Shapley, 1979, 1980).

Models in which the stimulus sensitivity is quadratic rather than of linear order are of particular biological significance, as we briefly touched on in section 1. Quadratic stimulus dependence is formally a restricted case of LNLN models, which in turn are an instance of multifeature LN models. Taken together, biologically motivated quadratic stimulus dependence provides the necessary mathematical simplifications for the very general multifeature LN model that make the problem of inferring quadratic models tractable. Next we briefly introduce multifeature LN models and then focus specifically on the issue of quadratic stimulus dependence.

## 3. Multiple Linear Features

In a typical experiment, a neuron can be driven by a synthetic stimulus containing any desired statistical structure. For probing the visual system, for example, this stimulus might be a random binary checkerboard, a drifting grating, or full-field light intensity flicker. If the neuron's response depends solely on the stimulus presented in the recent past of duration *T* (and possibly on its own previous spiking behavior), we can restrict our attention to stimulus clips **s** of length . These clips are drawn from a distribution *P*(**s**) that characterizes the stimulus; the *N* components of the vector **s** represent successive stimulus values in time and optionally across space. Our task is then to infer the dependence of the instantaneous spiking probability (firing rate) at time *t*, on the stimulus **s**(*t*) presented just prior to *t*.

If the neuron is well described by an LN model, where the spiking rate *r* is an arbitrary positive, point-wise nonlinear function *f* of the stimulus projected onto the filter, , and the stimulus distribution is chosen to be spherically symmetric, *P*(**s**)=*P*(|**s**|), we can use the spike-triggered average (STA) to obtain an unbiased estimate of the single linear filter **k** (de Boer & Kuyper, 1968; Simoncelli et al., 2004). Spike-triggered covariance (STC) generalizes the filter inference to cases where the firing rate depends nonlinearly on projections of the stimulus, (de Ruyter van Steveninck & Bialek, 1988). The number of relevant linear filters *K* is equal to the number of nonzero eigenvalues of the spike-triggered covariance matrix. A successful application of STC requires *P*(**s**) to be gaussian. STC has been used successfully, for example, to understand the computations performed by motion-sensitive neurons of the blowfly (Bialek & de Ruyter van Steveninck, 2005), map out the sensitivity to full-field flickering stimuli in salamander retinal ganglion cells (Fairhall et al., 2006), explore contrast gain control (Rust, Schwartz, Movshon, & Simoncelli, 2004; Schwartz et al., 2002), and understand adaptation in the rodent barrel cortex (Maravall et al., 2007).^{3}

^{4}Under this restriction, STC can reliably extract from

*K*=1 to relevant linear filters (Rust et al., 2004) for realistic recording durations. It is much more difficult to directly sample the

*K*-dimensional nonlinear function

*f*for

*K*larger than 2 or 3 without making additional assumptions. In case of quadratic stimulus dependence, however, such a simplification occurs naturally. Every real, symmetric matrix, including the putative quadratic filter

**Q**, can be spectrally decomposed into . The response of the quadratic model is thus explicitly demonstrating that quadratic models are special cases of the LNLN cascade, where the first linear stage applies the filters

**k**

_{i}, the first nonlinear stage squares the projections, the second linear stage is a summation with weights , and the final transformation applies the nonlinearity . This means that to infer quadratic dependence, we can identify the relevant stimulus filters using STC, parameterize the nonlinearity as a 1D nonlinear function of a linear combination of squared filter projections, and use maximum likelihood to infer the parameters of the nonlinearity (see, e.g., Schwartz et al., 2002).

Unfortunately, the gaussian ensemble can be a serious restriction for neurons that do not respond well (or at all) to unstructured stimuli; furthermore, under exclusively gaussian stimulation, we are likely to miss several neural mechanisms that depend on naturalistic statistical structure such as correlations and intermittency. A versatile method should therefore be able to successfully infer the multiple-filter dependence of a neuron probed with any stimulus of arbitrary complexity. Maximally informative dimensions (MID) (Sharpee et al., 2004) or likelihood inference for single-filter generalized linear models (Gerwinn et al., 2010; Paninski, 2004; Pillow, 2007; Pillow et al., 2008; Truccolo et al., 2004) have been used to this end when the dependence is linear, but until recently, the attempts to incorporate quadratic stimulus dependence into procedures that can be used with arbitrary stimuli have been uncommon.

## 4. Quadratic Stimulus Dependence

**k**

_{1}and

**k**

_{2}that together form a quadrature pair, such that the most informative variable concerning the neuron's firing is the power,

**k**

_{0}).

A simulated model neuron showing contrast adaptation is shown in Figure 2a, featuring first- as well as second-order stimulus sensitivity. The model neuron is probed with a flickering variance stimulus, in which the variance of white noise (with a very short correlation time) is dynamically modulated by a noise process correlated over a longer timescale (Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). With this synthetic stimulus, the separation of timescales allows us to partition the stimulus into chunks with approximately constant luminance variability . This variance is directly related to the temporal contrast, , because the average mean light intensity is kept constant. Within each stimulus segment, we can use STA to recover the LN model, as shown in Figures 2b and 2c. Our real goal, however, is to infer a joint model valid across the whole stimulus ensemble and ultimately to do so with naturalistic stimuli of scale-free power spectra and no clear separation between the fast fluctuations and the slow variance modulation.

**Q**) as well as a linear projection (parameterized by the filter

**k**

_{0}): Graphically, while a threshold LN model with a linear filter corresponds to a classifier whose separating hyperplane is perpendicular to the filter, the proposed LN model with a threshold nonlinearity and a quadratic filter

**Q**is selective for all stimuli that lie outside an

*N*-dimensional ellipsoid whose axes correspond to the eigenvectors of

**Q**.

**Q**is of rank

*M*, with eigenvalues

*w*and eigenvectors

_{i}**k**

_{i},

*i*>0. The complex cell example described in equation 4.1 has

**k**

_{0}=0 and ; in other words,

**Q**is a rank 2 matrix. While these examples feature quadratic dependences involving matrices of low rank, it is possible to extend quadratic models to biologically relevant cases where the matrix is of high rank (Rajan & Bialek, 2012). For example, the probability of spiking could be a nonlinear function of the power

*p*(

*t*),

*r*(

*t*)=

*f*[

*p*(

*t*)], where the power is given by Here

**s**(

*t*) is the stimulus, and

*f*

_{1}and

*f*

_{2}are linear filters such as those used to describe non-phase-locked auditory neurons. If the smoothing time of the second filter

*f*

_{2}is larger than that of the first filter

*f*

_{1}, Rajan and Bialek (2012) have shown that the quadratic kernel

**Q**for this model has a rich (full-rank) spectrum.

## 5. Inferring Quadratic Stimulus Dependence from Data

In this section we review methods that permit the inference of low- or full-rank quadratic kernels, **Q**, with arbitrary stimuli.

### 5.1. Finding Quadratic Filters Using Information Maximization.

*f*spectra, nongaussian histograms or high-order correlations). This is a big challenge when studying neurons beyond the sensory periphery that are responsible for extracting high-order structure or neurons that remain unresponsive to white noise presentations (e.g., those in the auditory pathway). To address this issue and recover filters in an unbiased manner with an arbitrary stimulus distribution, maximally informative dimensions (MID) (Kouh & Sharpee, 2009; Sharpee et al., 2004, 2006) have been developed and utilized to recover simple cell receptive fields, among other examples. MID looks for a linear filter

**k**that maximizes the information between the presence or absence of a spike and the projection

*x*of the stimulus onto

**k**, . Information per spike is then given by the Kullback-Leibler divergence of

*P*(

*x*|spike), the spike-triggered distribution (the distribution of stimulus projections preceding the spike), and

*P*(

*x*), the prior distribution (the overall distribution of projections), Given the spike train and the stimulus, finding

**k**becomes an information optimization problem in

*I*

_{spike}that can be solved using various annealing methods, although the existence of local extrema could make this a nontrivial task.

*P*(spike) is directly proportional to the average firing rate during the experiment.

In classical MID, one finds a (set of) linear filter(s) by maximizing equation 5.1 with respect to **k**. To generalize this inference method to quadratic stimulus dependence, a naive approach would make use of the spectral decomposition in equation 3.1. One would try recovering the quadratic dependence of **Q** in equation 4.3 by multidimensional MID (see Table 1), hoping to infer all as orthogonal informative dimensions. While formally true, this is infeasible in practice because maximizing the mutual information would involve sampling *N*-dimensional distributions from stimulus samples that are limited in number by the number of spikes (Sharpee et al., 2004). Information-theoretic STC (iSTAC) evades this problem, but at the cost of going back to gaussian stimuli and assuming a gaussian spike-triggered distribution; under those restrictions, it can be used to infer quadratic stimulus dependence (Pillow & Simoncelli, 2006).

**Q**can be reconstructed from an observed spike train by maximizing the information in equation 5.1, where

*x*is now given by . Taking a derivative of equation 5.1 with respect to

**Q**gives us a gradient, where indicates averaging over the spike-triggered and prior distributions, respectively, and the subscript

**Q**makes the dependence of the probability distributions explicit. Only the symmetric part of

**Q**contributes to

*x*, and the overall scale of the matrix is irrelevant to the information, making the number of free parameters

*N*(

*N*+1)/2−1. This approach makes the inference problem tractable even when

**Q**is of high rank.

**Q**, we can ascend the gradient in successive learning steps (Rajan & Bialek, 2012), The probability distributions within the gradient are obtained by computing

*x*for all stimuli, choosing an appropriate binning for the variable

*x*, and sampling binned versions of the spike-triggered and prior distributions. The averages are computed separately for each bin, and the integral in equations 5.1 and 5.3 and the derivative in equation 5.3 are approximated as a sum over bins and as a finite difference, respectively. To deal with local maxima in the objective function, we can use a large starting value of and gradually decrease during learning. This basic algorithm can be extended by using kernel density estimation and stochastic gradient ascent and annealing methods, but we do not report these technical improvements here.

**Q**by writing The basis can be chosen so that increasing the number of basis components

*M*allows the reconstruction of progressively finer features in

**Q**. We considered as a family of gaussian bumps that tile the space of the matrix

**Q**and whose scale (standard deviation) is inversely proportional to . For , the matrix set becomes a complete basis, allowing every

**Q**to be exactly represented by the vector of coefficients . In such a matrix basis representation, the learning rule becomes where applying the chain rule on yields the update term for each step.

We illustrate this approach with two examples. In the first example, we make use of the matrix basis expansion from equation 5.5 to infer a quadratic kernel **k** that is of high rank. For **k**, we used a highly structured matrix as shown in Figure 3a. While this is not an example of a receptive field from a real neuron, it illustrates the validity of the approach even when the response has an atypical and highly structured dependence on the stimulus. The stimuli were natural image clips from the Penn Natural Image database, flattened into a high-dimensional vector representation **s** (Tkačik et al., 2011), and the spikes were generated by thresholding the term . Gaussian basis matrices, similar to the 225 shown in Figure 3b, were used to expand the quadratic kernel, reducing the number of free parameters from approximately to a few hundred. We start the gradient ascent with a large value of 1 and progressively scale it down to 0.1 near the end of the algorithm; Figure 3e shows the information plateauing in about 20 learning steps. The maximally informative quadratic filter reconstructed from 400 basis coefficients is shown in Figure 3d. Figure 3c demonstrates how the root-mean-squared reconstruction error systematically decreases as the number of basis functions *M* is increased from 4 to 400, improving precision. Insets show two inferred matrices with *M*=100, corresponding to the first open circle, showing a marked improvement with *M*=225, corresponding to the second open circle. Reconstruction error drops to approximately 1% for *M*=400.

In contrast to standard MID where the number of spikes required grows exponentially in the number of filters extracted, the data requirement for this approach is proportional to the square of the stimulus dimension for a matrix kernel with no additional structural simplifications (these data requirement- and performance-related issues are explored in detail in Rajan & Bialek, 2012). For the examples shown in this review, expansion in matrix basis reduces this number to the order of stimulus dimension, making this procedure pertinent to experimentalists.

The second example shows the MISE analysis of the synthetic neuron presented in Figure 2 where the stimulus-response relationship is more biologically realistic. Here, a smooth nonlinear function *f* is used, and the model has a linear and a quadratic kernel. The analysis is applied to the flickering variance stimulus without partitioning it into regions of fixed contrast. With spikes, the method recovers the linear filter **k**_{0} as well as the quadratic kernel, which turns out to have the two dominant eigenvectors, **k**_{1} and **k**_{2}, corresponding to the quadrature pair of filters used to construct **Q**, as shown in Figure 4b.

These examples show that quadratic filters can be extracted using information maximization for both low-rank and full-rank matrices, under natural stimulation and with realistic numbers of spikes. Importantly, for cases where the stimulus sensitivity is both linear and quadratic, MISE does not explicitly assume that the collective effect of two filtering operations is necessarily additive, that is, that ; rather, the dependence can be an arbitrary 2D nonlinear function, . Unlike the quadratic generalizations of GLM presented below, this allows MISE to fully recover forms of contrast gain control that have a parametric form similar to equation 4.2.

### 5.2. Finding Quadratic Filters Using Maximization of Noise Entropy.

Another information-theoretic approach for inferring single-neuron sensitivities is derived from the principle of noise entropy maximization (Fitzgerald, Rowekamp et al., 2011; Fitzgerald, Sincich et al., 2011; Globerson et al., 2009). Suppose that the spiking or silence of a chosen neuron in a time bin indexed by *t* is represented by a binary variable . From data, we can reliably estimate certain statistics of the neural response, such as the average spiking rate , the spike-triggered average , or the spike-triggered covariance , where the brackets denote averaging across the duration of the experiment. In general, all of these statistics are of the form , where indexes the different operators whose expectation values we are computing.

*P*(

*y*|

**s**), the distribution of the (binary) neural response given the stimulus. Maximum entropy distributions are as unstructured (random, therefore parsimonious) as possible with the constraint that they exactly reproduce the measured expectation values of a chosen set of statistics, (Jaynes, 1957a, 1957b). When the variable

*y*is binary, it can easily be shown that these distributions have the form of the logistic function, where

*F*resembles the free energy in statistical physics, and are the Lagrange multipliers that have to be set such that the set of statistics measured in the data equals the expectation values of the same operators under distribution

*P*, that is, . To apply this general framework to the inference of quadratic filters, Fitzgerald, Rowekamp et al. (2011) choose the mean firing rate, STA and STC as constraints, which yields the following response distribution: where act as the Lagrange multipliers conjugated to the operators . Numerically, the task is to solve for parameters that satisfy a set of constraints: (matching the measured mean firing rate to that of the model), (matching the measured STA to that of the model), and (matching the measured STC to that of the model). This is a convex optimization task and can be solved by conjugate gradient descent.

*I*(spike;

**s**) as a difference between the total and the noise entropy, where is the entropy of

*P*(

*x*). The first term (total entropy) is fully determined by the probability of spiking , , because

*y*is a binary variable. The mean firing rate is one of the statistics constrained in the model for

*P*(

*y*|

**s**), ensuring consistency. Since our model for

*P*(

*y*|

**s**) has maximum entropy given the observed constraints, we are effectively setting an upper bound on the noise entropy and therefore a lower bound on the mutual information

*I*. As more and more statistics

*O*(

**s**) are included as constraints into the maximum entropy model for equation 5.7, the noise entropy must progressively drop and information must increase toward the true value (which is bounded by the output entropy). At the point where this lower bound on information meets the actual information per spike (which can be empirically estimated from repeated stimulation; see, e.g., Brenner et al., 2000), we obtain the complete set of relevant stimulus statistics that characterize the sensitivity of the neuron.

Fitzgerald, Rowekamp et al. (2011) show that this framework is applicable for inferring quadratic neural filters on synthetic and real data and compare it to MID. This method is applicable to any stimulus ensemble but requires assumptions beyond those needed for MID or MISE—namely, that the nonlinear function is logistic and that the contributions of the linear and quadratic filters combine additively. The advantage of this method is that the optimization problem remains convex, does not suffer from the exponential curse of dimensionality (like multidimensional MID), and is flexible, allowing different constraints (beyond the STA and STC) to be used for constructing models of the stimulus-conditional distribution *P*(*y*|**s**).

### 5.3. Finding Quadratic Filters in a Likelihood Framework: Extensions to GLM and Bayesian STC.

*r*(

*t*) is a nonlinear function

*f*of a sum of contributions, where

**k**is a linear filter acting on the stimulus

**s**,

**Q**is a linear filter acting on the spiking history

**y**(

*t*

_{−}) of the neuron, and is an offset or an intrinsic bias toward spiking or silence. When the stimulus and the spike train are discretized into time bins of duration , the probability of observing (an integral number of)

*y*spikes is Poisson, with a mean given by (where the subscript indexes the time bin). Here, we neglect the history dependence of the spikes (with no loss of generality) and focus instead on the stimulus dependence; since each time bin is conditionally independent given the stimulus (and past spiking), the log likelihood for any spike train is (Pillow, 2007), where

_{t}*c*is independent of both and

**k**. This likelihood can be maximized with respect to and

**k**(and, optionally, with respect to

**g**) given adequate spikes, providing an estimate of the filters from neural responses to complex, even natural stimuli. In contrast to maximally informative approaches, such as the stimulus energy derived in section 5.1 (Rajan & Bialek, 2012), the functional form of the nonlinearity

*f*is an explicit assumption in likelihood-based methods like GLM. For specific classes of the function

*f*, such as

*f*(

*z*)=log [1+exp(

*z*)], exp(

*z*), or , the likelihood optimization problem is convex, and gradient ascent is guaranteed to find a unique global maximum.

While the tractability consequent to convexity of the objective function is a big strength of this approach, the disadvantage is that if the chosen nonlinearity *f* is different from the true function used by the neuron, the filters inferred by maximizing likelihood in equation 5.12 could be biased. If we relax the stringent requirement for convexity, we can choose more general nonlinear functions for the model, for example, by parameterizing the nonlinearity in a point-wise fashion and inferring it jointly with the filters. For this discussion, however, we assume that *f* has been selected from the specific class of nonlinearities guaranteed to yield a convex likelihood function.

**s**of dimension

*N*into a larger space first, for instance, by forming (of dimension ), and then operate on this object with a filter: . Such a term can be added to the argument of

*f*in the model exemplified in equation 5.11. Specifically, we propose a generalized quadratic model of the following form, If we want to retain convexity, we cannot simply expand

**Q**in its eigenbasis and infer its vectors by maximizing the likelihood directly, because the eigenvectors appear quadratically. However, we can expand

**Q**into a weighted sum of matrix basis functions, as in equation 5.5, making the argument of

*f*a linear function of basis coefficients , Existing methods for inferring GLM parameters (Pillow et al., 2008) can be used to learn the linear and the quadratic filter

**Q**efficiently. After extracting

**Q**, we can check if a few principal components account for most of its structure (this is equivalent to checking whether

**Q**is indeed a low-rank matrix). To summarize, this procedure provides a way of extracting multiple filters with GLM that is analogous to diagonalizing the spike-triggered covariance matrix on the gaussian stimulus ensemble.

We have implemented such a generalized quadratic model using the flickering variance stimulus shown in Figure 2. The results are shown in Figure 4a. The recovered quadratic kernel decomposes into a quadrature pair of filters, and we recover the correct linear filter **k**_{0}. While this method is restricted to a linear combination of first- and second-order filters within the nonlinearity, the distinct advantage over MISE is that the inference problem is always convex with the appropriate choice of nonlinear functions.

The matrix is first decomposed into the eigensystem, , where **w** are not forced to have an *L*_{2} norm of 1 and to indicate whether the filter *i* is excitatory or suppressive (as in STC; Schwartz et al., 2006). Then a zero-mean gaussian prior is put on each eigenvector **w**_{i}, where the hyperparameter determines the variance of the elements of eigenvector *i* ( corresponds to eliminating the direction *i* from the quadratic kernel and reducing its rank by 1). Next, an iterative algorithm alternates between optimizing the likelihood with respect to model parameters and optimizing the evidence given the parameters with respect to . This procedure efficiently and accurately identifies the rank of the quadratic kernels in synthetic examples, providing an automatic alternative for distinguishing the significant from sampling-noise-induced eigenvectors in the STC and quadratic kernel inference. Finally, the authors show that equation 5.15 can be further generalized from the exponentiated quadratic function to a wider class of elliptic nonlinearities at no additional computational cost.

To summarize this section, the reviewed work shows that Bayesian generalization of STC and the generalization of GLMs to quadratic stimulus dependence yield equal probabilistic models for neural encoding that can be efficiently inferred for a restricted class of nonlinear functions. However, attention needs to paid to maintain the convexity of the optimization procedure and deal with the large number of free parameters in the quadratic kernel. Basis expansions as well as regularization with Bayesian priors seem like feasible candidates to this end.

## 6. Discussion

While powerful conceptually, the notion that neurons respond to multiple projections of the stimulus onto orthogonal filters is difficult to turn into a tractable inference procedure when the number of filters is large. To address this concern, alternative encoding models have been proposed where the neuron can be sensitive to high-order features in the stimulus. Instead of being described by multiple linear filters, the neuron's sensitivity properties are captured by a single quadratic filter (and, optionally, an additional linear filter). We have reviewed several inference methods for such quadratic stimulus dependence based on information maximization as well as maximizing likelihood. With MISE, no assumptions are made about how the projection onto the quadratic filter combines with the linear filter projection and how both map onto the spiking probability. This approach yields unbiased filter estimates under any stimulus ensemble but requires optimization in a rugged information landscape. Noise entropy maximization is a flexible, maximum-entropy-based framework for modeling the probability of spiking given a stimulus. It is computationally tractable and provides a convenient bound on the information per spike but assumes a specific form of the nonlinearity. Alternatively, with a specific choice of nonlinearity and filter basis, likelihood inference within the GLM class can be extended to quadratic stimulus dependence while retaining the convexity of the objective function. By formulating the problem as Bayesian inference and choosing sparsifying priors for the quadratic filter, the true rank of the quadratic filter can also be inferred from data.

All of these approaches for inferring quadratic stimulus dependence are complementary; as we show in the appendix, both information-maximization- and maximum-likelihood-based inference methods provide consistent filter estimates under defined conditions. A possible way to benefit from the tractability of likelihood formulations and maximization of noise entropy could be to use them to initialize a more general search using information maximization. This could potentially help avoid optimization problems in the rugged information landscape and remove the additive restrictions on the combination of linear and quadratic features.

Examples of recent work establishing the connection between the high-order structure of natural scenes and neural mechanisms beyond the sensory periphery (Karklin & Lewicki, 2009; Tkačik, Prentice, Victor, & Balasubramanian, 2010) make the development of methods for neural characterization such as the ones presented here very timely. Phenomena like phase invariance, adaptation to local contrast, or sensitivity to the signal envelope are widespread features of sensory neural responses (Baccus & Meister, 2004; Hubel & Wiesel, 1965; Touryan, Lau, & Dan, 2002). Moreover, as our abilities to record in vivo from the sensory systems of awake and behaving animals expand, so should the methods to analyze such recordings, where the relevant stimuli may no longer be perfectly controllable because of the animal's interaction with the environment (Kerr & Nimmerjahn, 2012). The methods presented here will help us systematically elucidate sensitivity to high-order statistical features from responses of sensory neurons to natural stimuli.

## Appendix: Formal Relationship Between Information-Theoretic and Likelihood-Based Inference

We now demonstrate analytically that under rather general assumptions, the linear or quadratic filters obtained by maximizing mutual information match the filters inferred by maximizing likelihood. We extend a reasoning we used previously in the context of inferring protein-DNA sequence-specific interactions in Kinney, Tkačik, and Callan (2007), to neural responses (see also Kouh & Sharpee, 2009; Williamson, Sahani, & Pillow, 2011).

In the following, *x* remains the projection of the stimulus **s** onto the linear () or quadratic () filter, with time discretized in bins of duration and indexed by subscript *t*. We collect all the parameters describing the filter into a vector . Given a single *x _{t}*,

*y*spikes are generated according to a conditional probability distribution . This probability distribution is typically assumed to be Poisson with a mean given by

_{t}*f*(

*x*) in the case of GLM, but we take a different approach.

_{t}*x*into bins and parameterize , which is a matrix, by a vector . Apart from assuming a cutoff value for the number of spikes per bin

_{t}*Y*

_{max}(which can always be chosen large enough to assign an arbitrarily low probability of observing >

*Y*

_{max}spikes in any real data set) and a particular discretization of the projection variable

*x*, we leave the probabilistic relationship between the projection and spike count completely unconstrained. The transformation from the stimulus to the spikes is then a Markov chain, fully specified by , The likelihood of the spike train given the stimulus

**s**is , where

*T*is the total number of time bins in the data set. With

*x*discretized into

*K*bins, any data set can be summarized by the count matrix , where is the Kronecker delta. Note that , where is simply the empirical distribution of the probability of observing

*y*spikes jointly with the projection

*x*. In terms of

*c*, the likelihood of the observed spike train is . Assuming that

*x*is adequately discretized and is Poisson with mean

*f*(

*x*), we will recover the generalized likelihood of equation 5.12.

*x*, thereby making into a (conditional probability) matrix, the simplest choice is the uniform prior. In this case, we set equal to the entries in the matrix and choose to be uniform over all valid matrices , such that the matrix entries are positive and the normalization constraint, for every

*x*, is enforced.

*y*and the projection

*x*, is the empirical spike count entropy, and the “correction” term in brackets measures the average Kullback-Leibler divergence (

*D*) between the empirical and model conditional distributions. Importantly, only this correction term is a function of the and thus of and is affected by the prior , which is being integrated over; the other terms can therefore be pulled outside the integral. We can write the log likelihood per time bin as where the correction is It is necessary to show that as the number of data

_{KL}*T*grows, the correction decreases for a given choice of prior distribution , and for the choice of uniform prior this is analytically tractable (Kinney et al., 2007). Intuitively, it is clear that as , the empirical distribution converges to the true underlying distribution

*p*(

*y*|

*x*), and the integral becomes dominated by the extremal point , such that, in the saddle point approximation, The distribution is the closest distribution to

*p*(

*y*|

*x*) in the space over which the prior is nonzero. As long as the prior assigns a nonzero probability to any (normalized) distribution, the divergence in will decrease and will vanish as

*T*grows. The case in which does not decay occurs when the prior completely excludes certain distributions by assigning a zero probability to them, while the data

*p*(

*y*|

*x*) precisely favor those excluded distributions.

The identity in equation A.10 is the sought-after connection between the inference using information maximization and the likelihood-based approach. In the limit of small time bins, maximizing the information per spike *I*_{spike} on the right-hand side of the identity (in maximally informative approaches, as in Rajan and Bialek (2012), Sharpee et al. (2004), and section 5.1 of this review) is the same as maximizing the model-averaged likelihood of equation A.5 on the left-hand side of the identity.

## Acknowledgments

We thank William Bialek and Michael J Berry II for insightful discussions and providing critical scientific input during the course of this project. We especially thank Jonathan Victor for helpful comments on the manuscript. This work was supported in part by the Human Frontiers Science Program, the Swartz Foundation, NSF grants PHY–0957573 and CCF–0939370, the WM Keck Foundation, and the ANR grant OPTIMA.

## References

*Calliphora erythrocephala*

*14*(pp.

## Notes

^{1}

We have worked out this problem in parallel with Park and Pillow (2011).

^{2}

When we speak of the order (e.g., linear, quadratic), we refer to the order of the kernel operating on the stimulus, which can be defined unambiguously. In contrast, the order of the neural processing system as a whole depends on stimulus statistics; for example, high-order statistical structure in the stimulus can conflate first- and second-order responses of the system. Likewise, aspects of the response explained by a second-order kernel inferred though gaussian noise depend on the power spectrum of the input.

^{3}

Before moving on, it seems appropriate to return to the Wiener-Volterra formalism and contrast it with spike-triggered methods for recovering LN models. The underlying assumptions of the two approaches may appear to be substantially different: first, because of the presence of the nonlinear (N) transformation in the LN model, and second, because the output of the LN model is usually taken to predict the rate of a stochastic point process, while Wiener-Volterra series are intended for analyzing deterministic systems (Wiener, 1958). Nevertheless, it is easy to see that when uncorrelated (i.e., white) gaussian noise is used to extract the filters of the LN model using spike-triggered average (STA) and spike-triggered covariance (STC), STA and STC also provide unbiased estimates (up to a scaling factor) of first- and second-order Wiener-Volterra kernels. The difference arises in subsequent analysis steps. In case of LN models, STA and STC are used solely as dimensionality-reduction steps to identify the relevant subspace of the stimuli in which the nonlinear transformation acts, while in the Wiener-Volterra formalism, STA and STC are literally the first two terms in a functional expansion that provides the best least-squares fit to the observed firing rate. Victor and Johannesma (1986) have further demonstrated that the Wiener-Volterra formalism is a special case of a general probabilistic maximum entropy framework for describing the joint distributions of stimuli and the responses they evoke. In this framework, for example, the classic Wiener-Volterra formalism is recovered if the stimulus distribution is gaussian and the response variable is also gaussian with additive noise. If the output variable is binary (spike/no spike), the same maximum entropy approach reduces to identifying LN-type models with exponential nonlinearities.

^{4}

Specifically, if stimuli are nongaussian (even if spherically symmetric), there exist nonlinear functions *f* for which filter estimates given by STC will be biased. An example of such bias with the binary stimulus is given in Schwartz et al. (2006).