## Abstract

Exploratory tools that are sensitive to arbitrary statistical variations in spike train observations open up the possibility of novel neuroscientific discoveries. Developing such tools, however, is difficult due to the lack of Euclidean structure of the spike train space, and an experimenter usually prefers simpler tools that capture only limited statistical features of the spike train, such as mean spike count or mean firing rate. We explore strictly positive-definite kernels on the space of spike trains to offer both a structural representation of this space and a platform for developing statistical measures that explore features beyond count or rate. We apply these kernels to construct measures of divergence between two point processes and use them for hypothesis testing, that is, to observe if two sets of spike trains originate from the same underlying probability law. Although there exist positive-definite spike train kernels in the literature, we establish that these kernels are not strictly definite and thus do not induce measures of divergence. We discuss the properties of both of these existing nonstrict kernels and the novel strict kernels in terms of their computational complexity, choice of free parameters, and performance on both synthetic and real data through kernel principal component analysis and hypothesis testing.

## 1. Introduction

Detecting changes in spike train observations is fundamental to the study of neuroscience at both neuronal and systems levels. For example, when studying neural coding, it is critical to ensure that the target neuron is encoding the stimulus (or behavior) of interest, which may be reflected in arbitrary statistical changes in its spiking pattern. This requires deciding if two sets of spike trains originate from the same probability law (point process). However, this seemingly benign task is technically daunting since the space of spike trains lacks Euclidean structure, making direct application of standard tests for equality of distribution such as the chi square test, Kolmogorov-Smirnov test, or Kullback-Leibler divergence unusable (Park & Príncipe, 2010; Seth, Park et al., 2010; Naud, Gerhard, Mensi, & Gerstner, 2011). Therefore, most analyses on spike trains are done by observing changes in simpler statistics such as mean spike count (Hubel & Wiesel, 1959), mean firing rate (Perkel, Gerstein, & Moore, 1967), time to first spike (Johansson & Birznieks, 2004), interspike interval distribution (Kang & Amari, 2008; Churchland et al., 2010), or average pairwise spike train distance (Naud et al., 2011). Such partial quantification blinds the experimenter to other quantifiable statistical structures, which results in poor data utilization and even rather dire consequences in experimental science. For instance, searching for neurons in vivo using a typical statistical feature associated with a neural code results in selection bias in favor of that specific neural code. In this letter, we follow an alternate approach of designing a two-sample test on point processes through the use of strictly positive-definite kernels on spike trains, since this test does not limit itself to a particular statistical feature (Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2007; Sriperumbudur, Gretton, Fukumizu, Lanckriet, & Schölkopf, 2008).

The popularity of positive-definite kernels stems from observations that they are inner products in some feature space that is often nonlinearly related to the input space, they can be defined on rather abstract spaces such as non-Euclidean space, and they implicitly impose a structure of the space. These simple observations allow the extension of many standard linear algorithms designed on the Euclidean space to nonlinear algorithms on abstract spaces by merely substituting the linear inner product with a suitable kernel, known as the kernel trick. This approach has been widely exploited in the context of text mining, bioinformatics, and information retrieval where kernels have been defined on strings, bag of words, and graphs (Schölkopf & Smola, 2002). The usefulness of kernel methods has not gone unnoticed in the neuroscience literature, where several successful attempts have been made to design positive-definite kernels on spike trains (Shpigelman, Singer, Paz, & Vaadia, 2005; Eichhorn et al., 2004; Schrauwen & Campenhout, 2007; Paiva, Park, & Príncipe, 2009) and apply them in practical problems such as classification for brain machine interface (Shpigelman et al., 2005; Eichhorn et al., 2004) and somatosensory stimulation (Li et al., 2011). Positive-definite kernels, however, may fall short in the context of representing an arbitrary probability law—kernels with such a property have recently been explored and are called characteristic kernels (Sriperumbudur et al., 2008)—and therefore they cannot be readily applied for testing equality of distributions. As we will demonstrate in detail, the available spike train kernels in the literature are not characteristic. It has, however, been shown that strictly positive definite (spd) kernels (a subset of positive-definite kernels) are characteristic (Sriperumbudur, Fukumizu, & Lanckriet, 2010).^{1} Given this recent development, and other interesting aspects of positive-definite kernels, we focus on exploring such kernels for spike trains.

We consider two separate approaches for designing spd kernels, where both approaches follow two simple steps: first, we injectively map the spike trains to a more structured space, and second, we define spd kernels on the latter spaces, which induces a kernel in the spike train space. The proposed approaches differ in their respective representation of the spike trains: a stratified space, which is formed by a disjoint union (direct sum) of Euclidean spaces, and a functional space, where they are represented as functions over time in *L*_{2}. In the stratified space approach, we use a novel family of kernels in the Euclidean space to form spd kernels on spike trains, while in the functional space, we exploit novel kernels in *L*_{2}. We submit that these kernels should also (1) involve as few parameters as possible, since, from a practical standpoint, a user is forced to perform an exhaustive search (evaluation) over all possible combinations of these parameters to achieve the best (statistically correct) result, and (2) be computationally efficient. Finally, we also explore what attributes of the point process these kernels are sensitive to.

The rest of the letter is organized as follows. In section 2, we discuss the notion of kernel-based divergence and dissimilarity measures and briefly explore the existing positive-definite kernels in the literature based on binning. In section 3, we introduce a family of novel strictly positive-definite kernels by representing the space of spike trains as a direct sum of Euclidean spaces. Although these kernels are efficient to compute, they suffer from estimation issues in practice. In order to alleviate this drawback, we introduce another family of strictly positive kernels in section 4 by representing a spike train as a function in an appropriate *L*_{2} space. We also examine a few existing kernels on this representation and discuss their nonstrictness. In section 5, we explore the characteristics of the existing and proposed kernels using kernel principal component analysis and compare their performance on simulated data in the context of hypothesis testing. In section 6, we apply these kernels on real neural recordings, and in section 7, we conclude with discussions on the practical aspects of these kernels and their applicability, as well as other uses of such kernels. (A Matlab implementation of the kernels and methods introduced in this letter can be found online at http://code.google.com/p/spiketrainlib.)

## 2. Background

### 2.1. Kernel-Based Divergence and Dissimilarity.

We start our discussion in an abstract topological space for technical completeness. Later we introduce a topological space for spike trains. Let be a topological space and be the Borel -algebra. We focus on measures on the measurable space .

*Let be a measurable space and be a -measurable function. The kernel K is called positive definite if and only if for any finite nonzero complex Borel measure , , where denotes complex conjugation. Furthermore, if the inequality is strict, then the kernel K is called strictly positive definite.*

Notice that this definition of (integrally) strict positive definiteness (Sriperumbudur, Fukumizu, Gretton, Lanckriet, & Schölkopf, 2009) is a generalization of the usual definition that involves countable summation rather than integration (Pinkus, 2004).

*K*, is nonnegative, and

*P*=

*Q*implies . A divergence measure , on the other hand, can be constructed by incorporating an spd kernel instead of a pd kernel in equation 2.1 (Gretton et al., 2007). Due to the strict positive-definiteness of

*K*, is nonnegative, and zero if and only if

*P*=

*Q*. In fact, one can show that the resulting divergence (dissimilarity) is a squared (pseudo-)metric induced from the distance in the reproducing kernel Hilbert space (RKHS) (Diks & Panchenko, 2007; Sriperumbudur et al., 2009).

A dissimilarity measure is a nonnegative statistic that assumes zero value if two probability laws are identical (Pekalska & Duin, 2005). For example, if and , then , that is, the pd kernel compares only the means of the random variables and . A dissimilarity statistic thus creates an equivalence class where two probability laws are indistinguishable even if they are not the same, and the statistical inference operates on only the quotient space. A divergence measure, on the other hand, is a nonnegative statistic that assumes zero value if and only if two probability laws are identical (Sriperumbudur et al., 2009); thus, it compares the entire probability law. Therefore, divergence is a stricter and a stronger statistic than dissimilarity.

This feature of an spd kernel can be better explained in terms of the characteristic kernel (Sriperumbudur et al., 2008). Mathematically, a kernel *K* is called characteristic if the map is injective. This condition implies that a characteristic kernel generates a unique functional representation of the probability law in the reproducing kernel Hilbert space, which cannot be achieved by a pd kernel, as shown in the case of . In terms of a characteristic kernel, therefore, a measure of divergence can be derived by measuring the distance between these unique functional representations of the corresponding probability laws. This distance is exactly given by the square root of equation 2.1. It has been recently shown that spd kernels are characteristic in nature, and that explains the advantage of spd kernels over pd kernels in the context of hypothesis testing (Sriperumbudur et al., 2009).

### 2.2. Binned Kernel.

One of the main difficulties of dealing with spike trains is the lack of a natural structural representation, such as a Euclidean structure (Seth, Park et al., 2010). Among several existing approaches of inducing structure in this space, the simplest and most popular is binning (Victor, 2002). In simple terms, binning counts the number of spikes that appear in prefixed segments of time referred to as bins, thus effectively representing a spike train on a grid in a Euclidean space where the dimensionality of the space is governed by the number of bins. Although this approach is widely used in practical applications for its simplicity and effectiveness, it has disadvantages that it considers only the number of spikes within a bin rather than their exact locations and introduces discontinuity in time. The exact locations can be, in principle, captured by reducing the size of these bins; however, it also increases the dimensionality of the Euclidean representation, increasing the data requirements for proper estimation.

Since the binned representation projects a spike train in a Euclidean space, a kernel on spike trains can easily be implemented using kernels on the Euclidean space such as *l*_{2} inner product or the gaussian kernel. However, this approach is not preferred in practice since the dimensionality of this space (i.e., the number of bins) is usually high, making the realizations relatively sparse. In addition, the kernels cannot easily incorporate the similarity of spikes in temporally neighboring bins: spikes from different bins do not interact. The count kernel *K*_{count} defined as the product of total number of spikes is an extreme case where there is only one bin. Note that the dissimilarity induced by the count kernel is simply the squared difference between the mean number of spikes.

Shpigelman et al. (2005) has considered a different approach of designing kernels on binned representation, where the resulting nontrivial kernel has been referred to as a spikernel. Spikernel is perhaps the earliest kernel designed for spike trains that has been explored in the literature. The biologically motivated design of this kernel relies on measuring the similarity between arbitrary segments (of equal lengths) of the binned spike trains. Spikernel is positive-definite by design since it explicitly constructs a mapping to the feature space and computes the inner product between two mapped spike trains to evaluate the kernel. However, this kernel is not strictly positive-definite. Consider two spike trains such that one is a jittered version of the other, but both have the same spike counts given a certain bin resolution. Then the Gram matrix formed by these two spike trains has zero determinant. In fact, any kernel defined on the binned representation is only positive-definite but not strictly positive-definite due to this restriction.

It should be noted that spikernel involves five free parameters, including bin size, and it is computationally expensive. Although the appropriate value of the bin size can be estimated from the time resolution of the spike train, there are no formal methods of choosing the other parameters. Therefore, it is suggested that a user perform an exhaustive search over the parameter space, making the kernel difficult to use in practice. We observe in our simulation that this kernel performs very well compared to other strictly positive-definite kernels. A possible explanation of this observation is the higher degrees of freedom.

A similar approach has also been explored by Eichhorn et al. (2004) by using edit distance (alignment score) between two strings to binned spike trains to construct an explicit mapping to the feature space. This approach has very similar advantages and disadvantages.

## 3. Stratified Kernel

Due to the lack of strict definiteness of the binned representation, it is essential to investigate other representations of the spike train space. In this section, we follow the stratified representation and introduce a set of strictly definite kernels on spike trains.

### 3.1. Stratified Representation.

Let be the set of all finite-length spike trains, that is, each can be represented by a finite set of (distinct) action potential timings where *n* is the number of spikes. Assuming a finite number of action potentials within the time interval of interest , one can partition the non-Euclidean spike train space in disjoint partitions such that contains all possible spike trains with exactly *n* action potentials within . We call this method stratification (Victor, 2002). Notice that . has only one element representing the empty spike train (no action potential). For , is essentially , since *n* action potentials can be fully described by *n* (ordered) time instances and vice versa. Without loss of generality, let , hence obtaining Euclidean space representation of each . Define a Borel algebra of by the -algebra generated by the union of Borel sets defined on the interval . Note that any measurable set can be partitioned into , each measurable in corresponding measurable space . Any finite-point process can be defined as a probability measure *P* on the measurable space (Daley & Vere-Jones, 1988).

### 3.2. Generic Kernel Design.

Although the stratification approach has been used by Victor (2002) to estimate the mutual information, this approach has not been explicitly exploited to design kernels. However, it provides an opportunity to build kernels on by combining kernels on using the following theorem:

### 3.3. Class of Kernels.

The individual kernels on each stratum can be chosen from a generic kernel as follows:

Let be the Schwarz space of rapidly decreasing functions. is -admissible if for every , the following integral equation has a solution *f*, .

If is a -admissible or a strictly positive-definite function, is a measure, and , then the following kernel is strictly positive-definite on , .

The proof of this theorem can be found in Seth, Rao, Park, and Príncipe (2010, theorem 1).

Examples of basis functions *g*(*x*,*u*) used for composition kernels are *e ^{ixu}*, , and for . In , we can use the tensor product kernels: .

There are a few notable special cases of this type of kernel. Notice that if *K*^{(n)}_{1}=1 for all , which induces a positive-definite kernel, the dissimilarity becomes, , which corresponds to the Cramér–von Mises (C-M) statistic for the count distributions and . Intuitively this kernel compares only if two spike trains have the same number of spikes, and therefore, it is not spd. Gaussian kernels on can be used to construct spd kernels on the space of spike trains. However, this process requires choosing kernel sizes for each stratum.

*F*denotes the cumulative distribution function in . Intuitively, this kernel not only compares if two spike trains have the same spike counts, but it also compares the positions of the spikes. Notice that it is very similar to the C-M type divergence proposed in Seth, Park et al. (2010). The advantage of this type of kernel is that it is parameter free, unlike the gaussian kernel.

### 3.4. Computational Issues.

The kernels defined on the stratification space are efficient to compute and easy to design. However, this type of kernel suffers from estimation issues for small sample size, especially when the count distribution is flat (not concentrated). In such circumstances, the samples become scattered in different strata , and the divergence becomes difficult to evaluate for a particular stratum. In addition, the stratification approach suffers from the curse of dimensionality when the number of spikes in a spike train is large. Hence, it may be necessary to reduce the time window of observation to reduce this dimensionality.

## 4. Functional Kernel

It is somewhat obvious that the estimation issues of stratified kernel are a direct consequence of the fact that for such kernels, two spike trains with different spike counts do not interact. To alleviate this issue, we follow a different representation.

### 4.1. Functional Representation.

Let be the space of all *L*_{2} integrable functions over , that is, . Given , define a mapping as such that *G* is injective. When *g* is a translation-invariant function that decays at , can be considered a smoothed spike train (van Rossum, 2001).

There are many different *g*’s that make the mapping *G* injective—for example, when *g*(*x*,*y*) is a bounded strictly positive-definite function. To see this, consider two spike trains, and , such that all spike timings are distinct (i.e., ), and assume that , which implies that , where , and , where and . Since **K** is full rank (by virtue of the strict positive-definiteness of ), for all implies that . Popular continuous-time spike train smoothing by linear filtering with rectangular function or alpha function (Dayan & Abbott, 2001) is also injective, given the width of the rectangular function is not more than the length of .

*G*is measurable, we can define an induced probability measure

*U*such that and

*U*(

*G*(

*A*))=

*P*(

*A*) for . A (strictly) positive-definite kernel on can be defined using a (strictly) positive-definite kernel on the

*L*

_{2}space since where .

### 4.2. Existing pd Kernels.

*K*

_{mCI}is generated by the trivial inner product kernel in the

*L*

_{2}space; thus, the name

*memoryless cross intensity*(mCI). Notice that several smoothing functions can also be used in this context. For example, the use of the exponential smoothing function is biologically motivated (van Rossum, 2001), and it is computationally simpler due to the characteristic of Laplacian function (Rao et al., 2011). The Laplacian function is found by taking the inner product of two exponential functions (Paiva et al., 2009), while the inner product of two gaussian functions results in a gaussian function. Notice that

*K*

_{mCI}induces a spike train reproducing kernel Hilbert space where a van Rossum–like distance (van Rossum, 2001) is induced by the inner product, and this kernel involves only one free parameter (i.e., the size of the smoothing function). Similar kernels have been proposed by Schrauwen and Campenhout (2007), Houghton (2009), and Paiva, Park, and Príncipe (2010). However, it is easy to see that these kernels are not strictly positive-definite. For example, consider the spike trains , , and . Then Gram matrices generated by these kernels are singular. Therefore, the dissimilarity statistic is not a divergence.

*K*

_{mCI}, a possible explanation being the nonlinear nature of the kernel. However, it should be noted that it involves one more free parameter, the kernel width of the outer gaussian kernel. Also, this kernel is difficult to compute, and a rectangular smoothing function is used to reduce the computational complexity (Paiva et al., 2009). However, this kernel is again not strictly positive-definite, and that can be again shown by a counterexample. Consider the following spike trains: , , , and ; then the resulting Gram matrix is singular.

### 4.3. Schoenberg Kernel.

In order to establish a strictly positive-definite kernel on this representation, we generalize the radial basis–type kernel on Euclidean space.

A function is said to be *completely monotone on * if , continuous at zero, and where denotes the *l*-th derivative.

Examples of completely monotone functions on are where and are constants (Berg, Christensen, & Ressel, 1984).

*L*

_{2}).

If a function is completely monotone on but not a constant function, then is strictly positive-definite on a separable and compact subset of *L*_{2} where denotes the *L*_{2} norm.

(sketch). Since is a completely monotone function, it has a Laplace transform representation, where *m* is a finite positive Borel measure on . Following Christmann and Steinwart (2010, theorem 2.2) is universal on a separable compact subset of *L*_{2} for *t*>0, and it implies strictly positive-definiteness (Sriperumbudur et al., 2010).

Note that the proof relies on the restriction of maximum number of spikes within a compact interval , which is a biological and experimental restriction, the space of spike trains is compact (and separable), and hence, the image of in is also compact (and separable).

### 4.4. Instances of Schoenberg Kernel.

*g*to be an exponential function: We refer to this kernel as Schoenberg (e). Notice that this kernel has been mentioned by Paiva et al. (2009), but it has not been explored further, and its strict definiteness has not been established. However, Paiva et al. (2009) have suggested the kernel

*K*

_{nCI}as a substitute and simplification of

*K*

^{e}_{Sch}. Notice that if the smoothing process is a linear filter, the bounded norm of the smoothing process holds when the smoothing filter is bounded and the total number of spikes is bounded. This condition virtually holds for bounded duration spike trains.

*K*

^{e}_{Sch}also involves an extra free parameter besides the smoothing parameter. However, the proposed approach allows other alternatives, which might not involve any parameter. One such option is the Heaviside function. It is easy to see that the smoothed representation of a spike train using a Heaviside function is simply the realization of the corresponding counting process. To differentiate the use of Heaviside function, we refer to the resulting kernel as Schoenberg (i): where . Notice that using the Heaviside function is also computationally simpler since the inner product of two smoothed functions simply depends on the locations of the spikes in an ordered set. Conceptually, however, due to the cumulative temporal counting nature of this approach, the resulting kernels depend on the initial spikes more than the latter ones. However, we observe that this kernel nonetheless work well in practice with the advantage of using one less free parameter.

Notice that compared to stratified kernel, functional kernels do not suffer from estimation issues in extreme conditions of high counts and low sample size, since they can compare two spike trains involving different spike number of spikes. But this is achieved at the cost of higher computational complexity. We provide detailed description of these kernels in Table 1. This table also includes a recently proposed kernel *K*_{reef} (Fisher & Banerjee, 2010), specialized for spike response model 0 with reciprocal exponential exponential function (REEF) activation. This kernel integrates out the unknown parameters, hence eliminates the need to estimate any free parameters. However, the strict definiteness of the kernel has not been pursued.

. | . | . | Time . | Number of . | . |
---|---|---|---|---|---|

Name . | Kernel . | spd . | Complexity . | Parameters . | Reference . |

Count | No | 0 | |||

Spikernel^{a} | No | 5 | Shpigelman et al. (2005) | ||

mCI | No | 1 | Paiva, Park, and Príncipe (2007) | ||

nCI | No | 2 | Paiva et al. (2009) | ||

Schoenberg (e) | Yes | 2 | Paiva et al. (2009) | ||

Schoenberg (i) | Yes | 1 | |||

Stratified | Yes | 1 () | |||

REEK | Unknown | 0 | Fisher and Banerjee (2010) |

. | . | . | Time . | Number of . | . |
---|---|---|---|---|---|

Name . | Kernel . | spd . | Complexity . | Parameters . | Reference . |

Count | No | 0 | |||

Spikernel^{a} | No | 5 | Shpigelman et al. (2005) | ||

mCI | No | 1 | Paiva, Park, and Príncipe (2007) | ||

nCI | No | 2 | Paiva et al. (2009) | ||

Schoenberg (e) | Yes | 2 | Paiva et al. (2009) | ||

Schoenberg (i) | Yes | 1 | |||

Stratified | Yes | 1 () | |||

REEK | Unknown | 0 | Fisher and Banerjee (2010) |

Notes: *N* is the average number of spikes per spike train. *f _{r}* is a rectangular smoothing function, the time interval is [0,

*T*). is a gaussian kernel on the Euclidean distance.

*d*

_{mCI}is the distance induced by the mCI kernel. is a Heaviside function. denotes binned spike train, is cardinality,

*I*

_{n,m}is the set of segments of length

*n*from total length

*m*, and

*B*is a segment of

_{i}*B*with first element at

*i*

_{1}. Unfortunately, we could not find available implementation and could not implement the kernel from Eichhorn et al. (2004) due to lack of details.

^{a} is the mapping from the binned spike train to the feature space where the kernel is evaluated as the inner product.

## 5. Features

So far we have discussed kernels and induced dissimilarities and divergences, but we have not addressed what they are actually sensitive to when applied to data. In principle, dissimilarities are sensitive only to certain statistical differences, while divergences are sensitive to arbitrary statistical differences; however, in practice, their sensitivity depends on the choice of kernel and is limited by data. We explore the issue of sensitivity in two aspects: decomposition of divergence in orthonormal projections using KPCA and sensitivity analysis for explicit features using hypothesis testing.

### 5.1. Kernel PCA-Based Decomposition.

Principal component analysis (PCA) is widely used to extract features—linear projections of data in a finite-dimensional Euclidean space obtained by finding directions of greatest variances. These features, or principal components (PCs), are simply obtained by the eigendecomposition of the data covariance matrix. Likewise, in kernel PCA (KPCA), one decomposes the covariance operator of the Hilbert space to find PCs that are now feature functionals, nonlinearly related to the input space (Schölkopf, Smola, & Müller, 1998). Each PC is orthogonal to each other and depends on the data as well as the kernel. Each kernel gives a different set of features, and the set of features for the same kernel depends on the spike trains in hand. This makes the interpretation of features difficult, yet it provides a useful qualitative insight of interpreting kernel induced divergence.

#### 5.1.1. Divergence and KPCA Projections.

**K**be the Gram matrix generated by these spike trains, , where is a predefined kernel, as discussed in the previous section. Then it can be easily seen that the estimated divergence (dissimilarity) is, Now let be the matrix of eigenvectors obtained from the eigendecomposition related to KPCA, , where is a diagonal matrix containing the eigenvalues of the respective eigenvectors, and corresponding eigenfunctions in the RKHS are simply where

*z*’s are elements of

_{i}*X*. Next, let , where

**b**

_{k}is simply the projection of the spike trains in

*X*on the

*k*th eigenfunction, ; then . Notice that the normalizing factor is essential since the eigenfunctions are simply orthogonal but not orthonormal in this formulation, and . Then the sum of squared distances between the means of the projections is where the s appear again since the projections exist on an orthogonal, not orthonormal, basis. Therefore, while discriminating two distributions through the KPCA projections, we plot for the best two eigenvectors for which are maximum since they explain away most of the divergence. Also, notice that KPCA is usually performed on the centered matrix

**K**

_{c}=

**HKH**, where . However, centering does not change the relation between divergence and the projections since

**H**is simply absorbed in the eigenvector matrix

**A**.

#### 5.1.2. Spike Trains with Count and Location Features.

In Figure 1, we visualize how a set of periodic spike trains with varying period and phase is embedded in the RKHS induced by each kernel. The data set consiststs of 16 spike trains for each of six different periods, each shifted such that the spike times uniformly span the interval (total 96 spike trains). The two-dimensional projection shows most kernels preserve the following features: (1) count (number of spikes, or equivalently period) and (2) phase. Notice that for each kernel, we choose the free parameters from a grid of values that provides maximum value of the divergence (see section 5.2 for details). The count kernel captures the first feature with PC1, and no other eigenvalue is significant, since the centered kernel matrix is rank 1. *K*_{spikernel}, *K*_{mCI}, *K*_{nCI}, *K ^{e}*

_{Sch}, and

*K*

_{reef}also seem to favor having the count feature as PC1; in addition, PC2 mostly represents the phase feature, that is, the positions of the projections in this direction are ordered according to their phase. However, such interpretation should be made cautiously, since count and phase are not necessarily orthogonal in the RKHS. For example,

*K*

^{i}_{Sch}mixes the two features along many PCs, and

*K*

_{strat}reveals strata as maximally separated clusters.

Next, we form two sets of spike trains by dividing the periodic spike trains of Figure 1 in two different ways to focus on each feature. Set A divides them in terms of count while maintaining the uniformity over phase within each condition. Set B divides the spike trains within each stratum (of equal number of spikes) into early-phase and late-phase ones, thus maintaining count feature as constant. Therefore, set A and B have the same KPCA features, but the differences within are due to independent features (count statistic and location statistic; McFadden, 1965; Daley & Vere-Jones, 1988).

All kernels successfully reject the null for set A, as could be expected from sensitivity to count shown in Figure 1. In Figure 2, spike trains from each condition are projected to the PCs with the highest divergence components. Interestingly, the PC1 captured the difference the most except for the stratified kernel, where the divergence was spread over several PCs. The projections show clear separation of the two clusters; however, individual projections show a distinct structure of the feature space.

For set B (see Figure 3), the count statistic was held constant; therefore, *K*_{count} trivially fails to discriminate them. Except for count kernel, the separation is clearly made, and again the distinct structure of the feature space contains the phase information. For stratified kernel, since the differences in phase are present in each of six strata, the divergence components are also spread. Note that the set of PCs with high divergence are often disjoint for sets A and B—*K*_{spikernel}, *K*_{nCI}, *K ^{i}*

_{Sch}and

*K*

^{e}_{Sch}, and

*K*

_{strat}—an indication that the count feature and location features are not mixed for this data set.

Although in this example the two sets of spike trains were disjoint, and hence separable in feature space, this, is not crucial. We emphasize that it is not necessary for the kernel to cluster the sample points to derive a good divergence measure, but only the distance between the mean element matters. Even if the support of the projection of each class completely overlays each other, theoretically the difference between the probability laws will be captured in the distance between the corresponding mean elements if the kernel is spd.

### 5.2. Hypothesis Testing.

In the previous section, we used KPCA analysis to qualitatively assess feature sensitivity. Next, we quantify the power of each kernel with hypothesis testing. Given two sets of spike trains, we decide if the underlying point processes that generated each of them are distinct. By comparing the performance for the detection of different statistics, we can quantify which features of interest the kernels are sensitive to. We also compare the Wilcoxon rank sum test, a nonparametric test widely used in neuroscience.

- •
*Surrogate test*. To generate a distribution of the statistic under the null hypothesis, we use the pooled observations as surrogate. Given realizations and , we randomly divide the pooled observations in segments of size*n*and*m*to generate the surrogate values. This procedure can be implemented efficiently by shuffling the Gram matrix generated by the pooled observations appropriately (Gretton et al., 2007; Gretton, Fukumizu, Harchaoui, & Sriperumbudur, 2009). Each bootstrapping had 9999 random shufflings of the kernel matrix. For the simulation studies, we report the rejection rate of the hypothesis from 200 independent trials. - •
*Parameter Selection*. Although strictly positive-definite kernels provide an efficient way of inducing measures of divergence in the space of spike trains, appropriate hypothesis testing requires choosing a suitable value for the free parameters, since the discriminability of two distributions heavily relies on the characteristics of the underlying kernel. Unfortunately, there are no established methods for this task. A trivial solution to this problem could be used to test the hypothesis over different values of the parameters and choose the one that provides the best discriminability. But this approach faces the problem of multiple testing. To avoid such situations, we consider a different approach, proposed by Sriperumbudur et al. (2009), by redefining the measure of divergence as , where is a set of kernels (e.g., kernels over different parameter values). It has been shown that is a measure of divergence if at least one in is strictly positive-definite (Sriperumbudur et al., 2009). We exploit this fact and use instead of to avoid multiple testing. We employ a grid of parameter values to construct from a certain as follows.

As mentioned before, spikernel uses binned representation. To reduce the computational load, we fix the bin size such that the number of bins would be 20 for the time interval of interest. Besides bin size, spikernel has four extra parameters (Shpigelman et al., 2005). We fix *p _{i}*=1 and evaluate the kernel for 12 different parameter settings by combining , which are part of the range of parameters the authors chose (Shpigelman et al., 2005). We use the implementation in C provided in the Spider package (http://people.kyb.tuebingen.mpg.de/spider).

The free parameters for the other kernels are chosen to be 0.1, 0.5, and 0.9 quantiles of the numerator by random subsampling of spike trains, in addition to half of the 0.1 quantile and twice the 0.9 quantile (for a total of five choices). This is a variation of a common practice in kernel methods to match the kernel size to the scale of the data (Schölkopf, 1997). When the parameters are nested (*K*_{nCI}, *K ^{e}*

_{Sch}), the overall scale parameter was chosen for each value of the parameter inside (e.g., size of the smoothing), thus resulting in 25 choices of parameter pairs in total.

For the stratified kernel *K*_{strat}, we use a gaussian kernel for each dimension. For each dimension, we scale the kernel size so that it matches the empirical mean *l*_{2} distance of a homogeneous Poisson process conditioned on the total number of spikes. The mean *l*_{2} distance of a homogeneous Poisson process in each stratum is a slowly growing function of dimension, where the rate of growth quickly reduces. The performance of such scaling is similar to using a single kernel size for all dimensions and does not change the conclusions.

#### 5.2.1. Homogeneous Poisson Processes.

The most common feature is the mean count, or homogeneous mean firing rate. We test the discriminability on homogeneous Poisson processes with rate 2 and 4 spikes within the interval where the randomness is maximized given the mean rate (Rényi, 1964). In Figure 4A, the Wilcoxon test, and dissimilarities induced by *K*_{count}, *K*_{spikernel}, *K*_{mCI}, and *K*_{reef}, perform well in this task, rejecting more than 90% with just 24 trials. Note that the spd kernels, especially stratified kernels, perform poorly compared to Wilcoxon for this task. However, features other than count distribution are not captured by the Wilcoxon test and divergence is not induced by the count kernel. Therefore, for the subsequent experiments, we intentionally keep the mean firing rates of the two point processes constant in order to emphasize that the spd kernels are indeed sensitive to other statistical properties of the point processes.

#### 5.2.2. Inhomogeneous Poisson Process.

As firing rate modulation over time is an abundantly observed phenomenon in neural code experiments, we test the relative performance of each measure. We use two inhomogeneous Poisson processes with same mean rate (five spikes in the interval) but with a different rate profile. The interval is divided in half, and a step change in the intensity function was introduced such that the first half and second half are constant rates. The rates were chosen to be two and three spikes per half interval for the two halves, respectively, for one process and three and two spikes for the other process.

Due to the high dispersion of count distribution that is identical for both processes, *K*_{strat} poorly performs but does not completely fail (see Figure 4B). *K*_{mCI} is designed to be sensitive to the (smoothed) mean rate and performs well, as expected. *K*_{spikernel}, *K*_{nCI}, *K ^{e}*

_{Sch},

*K*

^{i}_{Sch}performs on par with

*K*

_{mCI}.

#### 5.2.3. Two Spikes with Correlation.

Neurons can integrate and hence be history dependent on firing or can have a short membrane time constant and forget the past spiking. We explore a simple and instructive example to show if the divergences can detect such differences in temporal correlations. We investigate two point processes having at most two spikes each, where the processes share the same marginal intensity function but have different correlation structures (Seth, Park et al., 2010). In the first process, the two event timings are correlated, that is, the interspike interval (ISI) has a narrow distribution, whereas in the second process, the two event timings are independent, that is, each action potential has its own precise timing distribution. Both point processes have a lossy noise; each action potential has a probability of missing with probability *p*=0.1. The temporal jitter of each spike is 100 ms, and the interval is 300 ms. Since the highest dimension of the problem is 2, this problem is easier for the stratified kernel (see Figure 4C). Nevertheless, all spd kernel–induced divergences quickly catch up as the number of sample increases, while pd kernels are unable to discriminate.

#### 5.2.4. Renewal Versus Poisson Process.

In neural systems, spike timing features that deviate from a Poisson process, such as a refractory period, oscillation, and bursting, are often modeled with a renewal process (Nawrot et al., 2008). In this experiment, we compare a stationary renewal process with gamma interval distribution and an equi-rate homogeneous Poisson process (see Figure 4D). We observe that *K*_{spikernel} performs the best, and *K*_{nCI} performs on par with the spd kernels. However, recall that this performance of *K*_{spikernel} comes at the price of a higher computational cost.

#### 5.2.5. Precisely Timed Spike Trains.

When the same stimulation is presented to a neuronal system, the observed spike trains sometimes show a highly repeatable spatiotemporal pattern at the millisecond timescale. Recently these precisely timed spike trains (PTST) have been abundantly reported in both in vivo and in vitro preparations (Reinagel & Reid, 2002; DeWeese, Wehr, & Zador, 2003; Johansson & Birznieks, 2004). Despite being highly reproducible, different forms of trial-to-trial variability have also been observed (Tiesinga, Fellous, & Sejnowski, 2008).

We model a precisely timed spike train in an interval by *L* pairs of probability density and probability pairs . Each *f _{i}*(

*t*) corresponds to the temporal jitter, and

*p*corresponds to the probability of generating (otherwise missing) the spike. Each realization of the PTST model produces at most

_{i}*L*spikes, that is,

*L*determines the maximum dimension for the PTST. The equi-intensity Poisson process has the rate function . We test if the kernels can differentiate between the PTST and the equi-intensity Poisson process for

*L*= 3. We choose

*f*(

_{i}*t*) to be equal-variance gaussian distributions on a grid sampled from a uniform random variable and set

*p*=0.9.

_{i}We observe that the spd kernels again succeed in correctly rejecting the null hypothesis (see Figure 4E). However, several pd kernels, *K*_{spikernel} and *K*_{nCI}, also perform on par, while the other pd kernels fail similarily the renewal process experiment. The success of these pd kernels can be explained in terms of their rather extensive construction compared to the the simpler kernels such as *K*_{mCI}, which are linear in nature.

## 6. Real Data

### 6.1. Retinal Spiking Response.

To test the practical effectiveness of kernel-induced point-process divergences, we applied them to publicly available sensory response data.^{2} We used the rat retinal ganglion cell recordings from Nirenberg's lab (Jacobs, Grzywacz, & Nirenberg, 2006). There were 22 stimulus conditions repeated 30 times while recording 3.1 seconds long spike trains from 15 neurons. The task is to discriminate different stimulus conditions given the spike trains from a single neuron.

We compute the rejection rate of the result of hypothesis testing. If an experimenter were to select neurons that were sensitive to differences in the stimulus conditions, the rejection rate can be interpreted as the probability that he or she would correctly categorize the neuron as being sensitive (given a fixed false-positive rate of 10%). The rejection rate of the null hypothesis is shown in Figure 5. We organized the stimulus conditions in two ways: we randomly paired the 22 stimulus conditions to form 11 hypothesis tests, or, for each neuron, we paired the stimulus conditions to have a close median firing rate response by sorting the conditions and pairing neighbors for the test. The latter pairing is to emphasize the detectability for features other than mean count.

For spikernel, we spanned 24 parameters (the same as the simulation section, but allowing a bin size of 50 ms as well as 150 ms). Despite being most optimized, *K*_{spikernel} was the slowest kernel and yet did not perform as well as *K*_{mCI}. The difference between spikernel and mCI, nCI, Schoenberg kernels was more pronounced for the neighboring rate pair condition. Overall *K ^{e}*

_{Sch}and

*K*

_{nCI}were the best, getting close to rejecting all responses due to different stimuli. Stratified kernel performed the worst due to the large number of spikes (on average 20 to 50 spikes per trial).

### 6.2. Directionality Detection Task in Monkey MT.

Area MT (middle temporal) of the visual system is known to have neurons tuned to directional visual flow. The classical theory of MT is that the mean firing rate of these neurons encodes the strength of directional motion (Albright, 1984). Detecting such a tuning without any underlying assumptions, and with a small number of repeats, is often crucial for neuroscientists. We apply the proposed kernel-based divergences to single MT neuron recordings from a detection task performed by monkeys (Britten, Newsome, Shadlen, Celebrini, & Movshon, 1996).

We extracted a data set from a freely available online repository to obtain responses from 180 direction-sensitive MT neurons (Britten et al., 1996).^{3} We selected 30 trials for high-coherence levels of 12.8% and 25.6% (or low-coherence levels 3.2% and 6.4% for a more difficult task), each for the preferred direction and null direction. Constant strength motion stimulus is presented for 2 seconds, and we select a window at the end of 2 seconds that corresponds to average 3 spikes over the selected set of trials (for both directions).

Consistent with the rate coding under constant stimuli, the hypothesis testing for both pairs of conditions suggests that mean spike count is the main feature for discrimination (see Figure 6). The high-coherence task is easier for the monkey as well as for the hypothesis test.

Interestingly the kernels that showed better performance than the Wilcoxon rank sum task for simulations and results from the retina data set performed worse in this data set. This data set is collected from single-electrode electrophysiology where the experimenter explored the neurons in MT and selected direction-selective cells (Britten et al., 1996). We argue that this selection bias could have resulted in a strong discriminability feature in the mean count, and therefore simpler tests performed better, while spd kernels are distracted by other features of the data.

## 7. Conclusion

In this letter, we have studied strictly positive-definite kernels on the space of spike trains. A strictly positive-definite kernel is more powerful than a positive-definite kernel in the sense that it offers a unique functional representation of a probability law. Therefore, the former induces a divergence—a tool that can be applied in testing equality of distribution—while the latter induces only a dissimilarity. However, the performance of a kernel is governed not only by its mathematical properties but also by its computational simplicity and choice of free parameters. We have performed an extensive set of simulations to empirically compare the performance and complexity of both the existing positive-definite kernels and the proposed strictly definite kernels qualitatively, through kernel PCA, and quantitatively, through hypothesis testing, on both synthetic and real data. These results reveal useful insights into the characteristics of these kernels and also explore their practicality.

The simulation results reveal that some pd kernels often perform equally well compared to spd kernels. This observation is counterintuitive but not surprising; although spd kernels should asymptotically perform better in theory, this does not guarantee good performance for the small-sample regime. In fact, it is natural for a pd kernel designed to capture the exact discriminating statistical feature to outperform a generic spd kernel. For example, if a neuronal system exhibits a rate-coding scheme, then it is sufficient, and perhaps best, to use a kernel that is sensitive only to the rate of the neural signal. Therefore, the choice of kernel when studying neural codes should depend on how information is encoded and what the trial-to-trial variability structure is. However, assuming a neural code beforehand can lead to bias in the analysis; therefore, when in doubt, it is advantageous to have a generalist spd kernel.

We have also observed that for the real data sets, the count kernel, which fails for all but one artificial data set by design, often performs on par with or better than the sophisticated kernels in many data sets, including one shown in section 6.2. We believe that this is because the neurons were selected under the assumption that the mean firing rate, which is the statistic that the count kernel or Wilcoxon test is sensitive to, is modulated. It is possible that neurons that modulate responses in different ways are overlooked due to the unavailability of state-of-the-art analysis tools. Therefore, the use of strictly positive-definite kernels would potentially allow new discoveries through observing these seemingly irrelevant neuronal responses. We conclude by stating two partially novel aspects of this framework that we have not explicitly addressed in this letter:

An spd (or a pd) kernel induces a (pseudo-) metric, since the kernel can be treated as an inner product in a reproducing kernel Hilbert space (Schölkopf & Smola, 2002). Therefore, in essence, this approach also provides families of distance measures. This attribute relates this letter to work on spike train metrics (Victor & Purpura, 1997; van Rossum, 2001). However, both Victor and Purpura (1997) and van Rossum (2001) restrict themselves to designing distance measures between two spike train realizations of point processes, whereas our work is more general since it introduces distance between two point processes (or collection of spike trains).

The use of strictly positive-definite kernels is not restricted to the design of divergence; the characteristic nature of such kernels allows them to embed a probability law—a point process in our case—as an element in the reproducing kernel Hilbert space (Smola, Gretton, Song, & Schölkopf, 2007). This allows probabilistic operations such as Bayesian inference to be performed in terms of functional operations without the need of estimating or approximating the probability law (Song, Boots, Siddiqi, Gordon, & Smola, 2010). Therefore, the content of this letter naturally links these advanced machine learning tools to neuroscience.

We leave exploring these directions for future work. The code for spike train kernels and divergence tests is available at the author's Web site: http://www.memming.com.

## Acknowledgments

We are grateful to the referees for their very helpful comments, which improved this letter significantly. This work was supported by NSF grant ECCS-0856441 and DARPA grant N66001-10-C-2008.

## Notes

^{1}

Strict in a measure-theoretic sense, which is stronger than the usual countable summation sense (see definition 1).

^{2}

Available online from http://neurodatabase.org (Gardner, 2004).

^{3}

Neural Signal Archive: http://www.neuralsignal.org/.