## Abstract

Exploratory tools that are sensitive to arbitrary statistical variations in spike train observations open up the possibility of novel neuroscientific discoveries. Developing such tools, however, is difficult due to the lack of Euclidean structure of the spike train space, and an experimenter usually prefers simpler tools that capture only limited statistical features of the spike train, such as mean spike count or mean firing rate. We explore strictly positive-definite kernels on the space of spike trains to offer both a structural representation of this space and a platform for developing statistical measures that explore features beyond count or rate. We apply these kernels to construct measures of divergence between two point processes and use them for hypothesis testing, that is, to observe if two sets of spike trains originate from the same underlying probability law. Although there exist positive-definite spike train kernels in the literature, we establish that these kernels are not strictly definite and thus do not induce measures of divergence. We discuss the properties of both of these existing nonstrict kernels and the novel strict kernels in terms of their computational complexity, choice of free parameters, and performance on both synthetic and real data through kernel principal component analysis and hypothesis testing.

## 1.  Introduction

Detecting changes in spike train observations is fundamental to the study of neuroscience at both neuronal and systems levels. For example, when studying neural coding, it is critical to ensure that the target neuron is encoding the stimulus (or behavior) of interest, which may be reflected in arbitrary statistical changes in its spiking pattern. This requires deciding if two sets of spike trains originate from the same probability law (point process). However, this seemingly benign task is technically daunting since the space of spike trains lacks Euclidean structure, making direct application of standard tests for equality of distribution such as the chi square test, Kolmogorov-Smirnov test, or Kullback-Leibler divergence unusable (Park & Príncipe, 2010; Seth, Park et al., 2010; Naud, Gerhard, Mensi, & Gerstner, 2011). Therefore, most analyses on spike trains are done by observing changes in simpler statistics such as mean spike count (Hubel & Wiesel, 1959), mean firing rate (Perkel, Gerstein, & Moore, 1967), time to first spike (Johansson & Birznieks, 2004), interspike interval distribution (Kang & Amari, 2008; Churchland et al., 2010), or average pairwise spike train distance (Naud et al., 2011). Such partial quantification blinds the experimenter to other quantifiable statistical structures, which results in poor data utilization and even rather dire consequences in experimental science. For instance, searching for neurons in vivo using a typical statistical feature associated with a neural code results in selection bias in favor of that specific neural code. In this letter, we follow an alternate approach of designing a two-sample test on point processes through the use of strictly positive-definite kernels on spike trains, since this test does not limit itself to a particular statistical feature (Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2007; Sriperumbudur, Gretton, Fukumizu, Lanckriet, & Schölkopf, 2008).

The popularity of positive-definite kernels stems from observations that they are inner products in some feature space that is often nonlinearly related to the input space, they can be defined on rather abstract spaces such as non-Euclidean space, and they implicitly impose a structure of the space. These simple observations allow the extension of many standard linear algorithms designed on the Euclidean space to nonlinear algorithms on abstract spaces by merely substituting the linear inner product with a suitable kernel, known as the kernel trick. This approach has been widely exploited in the context of text mining, bioinformatics, and information retrieval where kernels have been defined on strings, bag of words, and graphs (Schölkopf & Smola, 2002). The usefulness of kernel methods has not gone unnoticed in the neuroscience literature, where several successful attempts have been made to design positive-definite kernels on spike trains (Shpigelman, Singer, Paz, & Vaadia, 2005; Eichhorn et al., 2004; Schrauwen & Campenhout, 2007; Paiva, Park, & Príncipe, 2009) and apply them in practical problems such as classification for brain machine interface (Shpigelman et al., 2005; Eichhorn et al., 2004) and somatosensory stimulation (Li et al., 2011). Positive-definite kernels, however, may fall short in the context of representing an arbitrary probability law—kernels with such a property have recently been explored and are called characteristic kernels (Sriperumbudur et al., 2008)—and therefore they cannot be readily applied for testing equality of distributions. As we will demonstrate in detail, the available spike train kernels in the literature are not characteristic. It has, however, been shown that strictly positive definite (spd) kernels (a subset of positive-definite kernels) are characteristic (Sriperumbudur, Fukumizu, & Lanckriet, 2010).1 Given this recent development, and other interesting aspects of positive-definite kernels, we focus on exploring such kernels for spike trains.

We consider two separate approaches for designing spd kernels, where both approaches follow two simple steps: first, we injectively map the spike trains to a more structured space, and second, we define spd kernels on the latter spaces, which induces a kernel in the spike train space. The proposed approaches differ in their respective representation of the spike trains: a stratified space, which is formed by a disjoint union (direct sum) of Euclidean spaces, and a functional space, where they are represented as functions over time in L2. In the stratified space approach, we use a novel family of kernels in the Euclidean space to form spd kernels on spike trains, while in the functional space, we exploit novel kernels in L2. We submit that these kernels should also (1) involve as few parameters as possible, since, from a practical standpoint, a user is forced to perform an exhaustive search (evaluation) over all possible combinations of these parameters to achieve the best (statistically correct) result, and (2) be computationally efficient. Finally, we also explore what attributes of the point process these kernels are sensitive to.

The rest of the letter is organized as follows. In section 2, we discuss the notion of kernel-based divergence and dissimilarity measures and briefly explore the existing positive-definite kernels in the literature based on binning. In section 3, we introduce a family of novel strictly positive-definite kernels by representing the space of spike trains as a direct sum of Euclidean spaces. Although these kernels are efficient to compute, they suffer from estimation issues in practice. In order to alleviate this drawback, we introduce another family of strictly positive kernels in section 4 by representing a spike train as a function in an appropriate L2 space. We also examine a few existing kernels on this representation and discuss their nonstrictness. In section 5, we explore the characteristics of the existing and proposed kernels using kernel principal component analysis and compare their performance on simulated data in the context of hypothesis testing. In section 6, we apply these kernels on real neural recordings, and in section 7, we conclude with discussions on the practical aspects of these kernels and their applicability, as well as other uses of such kernels. (A Matlab implementation of the kernels and methods introduced in this letter can be found online at http://code.google.com/p/spiketrainlib.)

## 2.  Background

### 2.1.  Kernel-Based Divergence and Dissimilarity.

We start our discussion in an abstract topological space for technical completeness. Later we introduce a topological space for spike trains. Let be a topological space and be the Borel -algebra. We focus on measures on the measurable space .

Definition 1 (strictly positive-definite kernel).

Let be a measurable space and be a -measurable function. The kernel K is called positive definite if and only if for any finite nonzero complex Borel measure , , where denotes complex conjugation. Furthermore, if the inequality is strict, then the kernel K is called strictly positive definite.

Notice that this definition of (integrally) strict positive definiteness (Sriperumbudur, Fukumizu, Gretton, Lanckriet, & Schölkopf, 2009) is a generalization of the usual definition that involves countable summation rather than integration (Pinkus, 2004).

Let denote the set of probability measures on . When is the space of spike trains, is the space of point processes (a random process that generates spike trains). Given a positive-definite kernel , we define a measure of dissimilarity as
2.1
where . Due to the positive-definiteness of K, is nonnegative, and P=Q implies . A divergence measure , on the other hand, can be constructed by incorporating an spd kernel instead of a pd kernel in equation 2.1 (Gretton et al., 2007). Due to the strict positive-definiteness of K, is nonnegative, and zero if and only if P=Q. In fact, one can show that the resulting divergence (dissimilarity) is a squared (pseudo-)metric induced from the distance in the reproducing kernel Hilbert space (RKHS) (Diks & Panchenko, 2007; Sriperumbudur et al., 2009).

A dissimilarity measure is a nonnegative statistic that assumes zero value if two probability laws are identical (Pekalska & Duin, 2005). For example, if and , then , that is, the pd kernel compares only the means of the random variables and . A dissimilarity statistic thus creates an equivalence class where two probability laws are indistinguishable even if they are not the same, and the statistical inference operates on only the quotient space. A divergence measure, on the other hand, is a nonnegative statistic that assumes zero value if and only if two probability laws are identical (Sriperumbudur et al., 2009); thus, it compares the entire probability law. Therefore, divergence is a stricter and a stronger statistic than dissimilarity.

This feature of an spd kernel can be better explained in terms of the characteristic kernel (Sriperumbudur et al., 2008). Mathematically, a kernel K is called characteristic if the map is injective. This condition implies that a characteristic kernel generates a unique functional representation of the probability law in the reproducing kernel Hilbert space, which cannot be achieved by a pd kernel, as shown in the case of . In terms of a characteristic kernel, therefore, a measure of divergence can be derived by measuring the distance between these unique functional representations of the corresponding probability laws. This distance is exactly given by the square root of equation 2.1. It has been recently shown that spd kernels are characteristic in nature, and that explains the advantage of spd kernels over pd kernels in the context of hypothesis testing (Sriperumbudur et al., 2009).

### 2.2.  Binned Kernel.

One of the main difficulties of dealing with spike trains is the lack of a natural structural representation, such as a Euclidean structure (Seth, Park et al., 2010). Among several existing approaches of inducing structure in this space, the simplest and most popular is binning (Victor, 2002). In simple terms, binning counts the number of spikes that appear in prefixed segments of time referred to as bins, thus effectively representing a spike train on a grid in a Euclidean space where the dimensionality of the space is governed by the number of bins. Although this approach is widely used in practical applications for its simplicity and effectiveness, it has disadvantages that it considers only the number of spikes within a bin rather than their exact locations and introduces discontinuity in time. The exact locations can be, in principle, captured by reducing the size of these bins; however, it also increases the dimensionality of the Euclidean representation, increasing the data requirements for proper estimation.

Since the binned representation projects a spike train in a Euclidean space, a kernel on spike trains can easily be implemented using kernels on the Euclidean space such as l2 inner product or the gaussian kernel. However, this approach is not preferred in practice since the dimensionality of this space (i.e., the number of bins) is usually high, making the realizations relatively sparse. In addition, the kernels cannot easily incorporate the similarity of spikes in temporally neighboring bins: spikes from different bins do not interact. The count kernel Kcount defined as the product of total number of spikes is an extreme case where there is only one bin. Note that the dissimilarity induced by the count kernel is simply the squared difference between the mean number of spikes.

Shpigelman et al. (2005) has considered a different approach of designing kernels on binned representation, where the resulting nontrivial kernel has been referred to as a spikernel. Spikernel is perhaps the earliest kernel designed for spike trains that has been explored in the literature. The biologically motivated design of this kernel relies on measuring the similarity between arbitrary segments (of equal lengths) of the binned spike trains. Spikernel is positive-definite by design since it explicitly constructs a mapping to the feature space and computes the inner product between two mapped spike trains to evaluate the kernel. However, this kernel is not strictly positive-definite. Consider two spike trains such that one is a jittered version of the other, but both have the same spike counts given a certain bin resolution. Then the Gram matrix formed by these two spike trains has zero determinant. In fact, any kernel defined on the binned representation is only positive-definite but not strictly positive-definite due to this restriction.

It should be noted that spikernel involves five free parameters, including bin size, and it is computationally expensive. Although the appropriate value of the bin size can be estimated from the time resolution of the spike train, there are no formal methods of choosing the other parameters. Therefore, it is suggested that a user perform an exhaustive search over the parameter space, making the kernel difficult to use in practice. We observe in our simulation that this kernel performs very well compared to other strictly positive-definite kernels. A possible explanation of this observation is the higher degrees of freedom.

A similar approach has also been explored by Eichhorn et al. (2004) by using edit distance (alignment score) between two strings to binned spike trains to construct an explicit mapping to the feature space. This approach has very similar advantages and disadvantages.

## 3.  Stratified Kernel

Due to the lack of strict definiteness of the binned representation, it is essential to investigate other representations of the spike train space. In this section, we follow the stratified representation and introduce a set of strictly definite kernels on spike trains.

### 3.1.  Stratified Representation.

Let be the set of all finite-length spike trains, that is, each can be represented by a finite set of (distinct) action potential timings where n is the number of spikes. Assuming a finite number of action potentials within the time interval of interest , one can partition the non-Euclidean spike train space in disjoint partitions such that contains all possible spike trains with exactly n action potentials within . We call this method stratification (Victor, 2002). Notice that . has only one element representing the empty spike train (no action potential). For , is essentially , since n action potentials can be fully described by n (ordered) time instances and vice versa. Without loss of generality, let , hence obtaining Euclidean space representation of each . Define a Borel algebra of by the -algebra generated by the union of Borel sets defined on the interval . Note that any measurable set can be partitioned into , each measurable in corresponding measurable space . Any finite-point process can be defined as a probability measure P on the measurable space (Daley & Vere-Jones, 1988).

### 3.2.  Generic Kernel Design.

Although the stratification approach has been used by Victor (2002) to estimate the mutual information, this approach has not been explicitly exploited to design kernels. However, it provides an opportunity to build kernels on by combining kernels on using the following theorem:

Theorem 1 (stratified strictly positive-definite kernel).
Let be a family of bounded strictly positive-definite kernels for every . Define
3.1
Then Ks is a strictly positive-definite kernel on spike trains,
Proof.
For any nonzero Borel measure , we can decompose it as :
where denotes the restriction of the measure to , and the interchange of sum and integral is justified by the boundedness of spd kernels.
The corresponding divergence measure can be simplified as
3.2
where .

### 3.3.  Class of Kernels.

The individual kernels on each stratum can be chosen from a generic kernel as follows:

Let be the Schwarz space of rapidly decreasing functions. is -admissible if for every , the following integral equation has a solution f, .

Theorem 2 (composition kernels).

If is a -admissible or a strictly positive-definite function, is a measure, and , then the following kernel is strictly positive-definite on , .

Proof.

The proof of this theorem can be found in Seth, Rao, Park, and Príncipe (2010, theorem 1).

Examples of basis functions g(x,u) used for composition kernels are eixu, , and for . In , we can use the tensor product kernels: .

There are a few notable special cases of this type of kernel. Notice that if K(n)1=1 for all , which induces a positive-definite kernel, the dissimilarity becomes, , which corresponds to the Cramér–von Mises (C-M) statistic for the count distributions and . Intuitively this kernel compares only if two spike trains have the same number of spikes, and therefore, it is not spd. Gaussian kernels on can be used to construct spd kernels on the space of spike trains. However, this process requires choosing kernel sizes for each stratum.

An interesting kernel is found by using the following composite kernels on ,
where and . The corresponding divergence is given by
3.3
where F denotes the cumulative distribution function in . Intuitively, this kernel not only compares if two spike trains have the same spike counts, but it also compares the positions of the spikes. Notice that it is very similar to the C-M type divergence proposed in Seth, Park et al. (2010). The advantage of this type of kernel is that it is parameter free, unlike the gaussian kernel.

### 3.4.  Computational Issues.

The kernels defined on the stratification space are efficient to compute and easy to design. However, this type of kernel suffers from estimation issues for small sample size, especially when the count distribution is flat (not concentrated). In such circumstances, the samples become scattered in different strata , and the divergence becomes difficult to evaluate for a particular stratum. In addition, the stratification approach suffers from the curse of dimensionality when the number of spikes in a spike train is large. Hence, it may be necessary to reduce the time window of observation to reduce this dimensionality.

## 4.  Functional Kernel

It is somewhat obvious that the estimation issues of stratified kernel are a direct consequence of the fact that for such kernels, two spike trains with different spike counts do not interact. To alleviate this issue, we follow a different representation.

### 4.1.  Functional Representation.

Let be the space of all L2 integrable functions over , that is, . Given , define a mapping as such that G is injective. When g is a translation-invariant function that decays at , can be considered a smoothed spike train (van Rossum, 2001).

There are many different g’s that make the mapping G injective—for example, when g(x,y) is a bounded strictly positive-definite function. To see this, consider two spike trains, and , such that all spike timings are distinct (i.e., ), and assume that , which implies that , where , and , where and . Since K is full rank (by virtue of the strict positive-definiteness of ), for all implies that . Popular continuous-time spike train smoothing by linear filtering with rectangular function or alpha function (Dayan & Abbott, 2001) is also injective, given the width of the rectangular function is not more than the length of .

We consider the -algebra of to be the one that makes the map measurable. Since G is measurable, we can define an induced probability measure U such that and U(G(A))=P(A) for . A (strictly) positive-definite kernel on can be defined using a (strictly) positive-definite kernel on the L2 space since
where .

### 4.2.  Existing pd Kernels.

Paiva et al. (2009) proposed the following kernel using exponential smoothing function,
4.1
If the smoothing function is nonnegative and locally centered around the spike, the conditional mean can be interpreted as an estimate of the marginal firing rate (intensity function). The kernel KmCI is generated by the trivial inner product kernel in the L2 space; thus, the name memoryless cross intensity (mCI). Notice that several smoothing functions can also be used in this context. For example, the use of the exponential smoothing function is biologically motivated (van Rossum, 2001), and it is computationally simpler due to the characteristic of Laplacian function (Rao et al., 2011). The Laplacian function is found by taking the inner product of two exponential functions (Paiva et al., 2009), while the inner product of two gaussian functions results in a gaussian function. Notice that KmCI induces a spike train reproducing kernel Hilbert space where a van Rossum–like distance (van Rossum, 2001) is induced by the inner product, and this kernel involves only one free parameter (i.e., the size of the smoothing function). Similar kernels have been proposed by Schrauwen and Campenhout (2007), Houghton (2009), and Paiva, Park, and Príncipe (2010). However, it is easy to see that these kernels are not strictly positive-definite. For example, consider the spike trains , , and . Then Gram matrices generated by these kernels are singular. Therefore, the dissimilarity statistic is not a divergence.
Paiva et al. (2009) also proposed a nonlinear, and richer, extension of this kernel, referred to as the nonlinear cross intensity (nCI). The direct sum of positive-definite kernels defines a positive-definite kernel, and this has motivated the following construction (Paiva et al., 2009):
4.2
This kernel performs better than KmCI, a possible explanation being the nonlinear nature of the kernel. However, it should be noted that it involves one more free parameter, the kernel width of the outer gaussian kernel. Also, this kernel is difficult to compute, and a rectangular smoothing function is used to reduce the computational complexity (Paiva et al., 2009). However, this kernel is again not strictly positive-definite, and that can be again shown by a counterexample. Consider the following spike trains: , , , and ; then the resulting Gram matrix is singular.

### 4.3.  Schoenberg Kernel.

In order to establish a strictly positive-definite kernel on this representation, we generalize the radial basis–type kernel on Euclidean space.

Definition 3 (completely monotone functions, Schaback & Wendland, 2001).

A function is said to be completely monotone on if , continuous at zero, and where denotes the l-th derivative.

Examples of completely monotone functions on are where and are constants (Berg, Christensen, & Ressel, 1984).

Theorem 3 (strictly positive-definite Schoenberg kernels on L2).

If a function is completely monotone on but not a constant function, then is strictly positive-definite on a separable and compact subset of L2 where denotes the L2 norm.

Proof.

(sketch). Since is a completely monotone function, it has a Laplace transform representation, where m is a finite positive Borel measure on . Following Christmann and Steinwart (2010, theorem 2.2) is universal on a separable compact subset of L2 for t>0, and it implies strictly positive-definiteness (Sriperumbudur et al., 2010).

Note that the proof relies on the restriction of maximum number of spikes within a compact interval , which is a biological and experimental restriction, the space of spike trains is compact (and separable), and hence, the image of in is also compact (and separable).

### 4.4.  Instances of Schoenberg Kernel.

A typical instance of Schoenberg kernel considers both and the smoothing function g to be an exponential function:
4.3
We refer to this kernel as Schoenberg (e). Notice that this kernel has been mentioned by Paiva et al. (2009), but it has not been explored further, and its strict definiteness has not been established. However, Paiva et al. (2009) have suggested the kernel KnCI as a substitute and simplification of KeSch. Notice that if the smoothing process is a linear filter, the bounded norm of the smoothing process holds when the smoothing filter is bounded and the total number of spikes is bounded. This condition virtually holds for bounded duration spike trains.
KeSch also involves an extra free parameter besides the smoothing parameter. However, the proposed approach allows other alternatives, which might not involve any parameter. One such option is the Heaviside function. It is easy to see that the smoothed representation of a spike train using a Heaviside function is simply the realization of the corresponding counting process. To differentiate the use of Heaviside function, we refer to the resulting kernel as Schoenberg (i):
where . Notice that using the Heaviside function is also computationally simpler since the inner product of two smoothed functions simply depends on the locations of the spikes in an ordered set. Conceptually, however, due to the cumulative temporal counting nature of this approach, the resulting kernels depend on the initial spikes more than the latter ones. However, we observe that this kernel nonetheless work well in practice with the advantage of using one less free parameter.

Notice that compared to stratified kernel, functional kernels do not suffer from estimation issues in extreme conditions of high counts and low sample size, since they can compare two spike trains involving different spike number of spikes. But this is achieved at the cost of higher computational complexity. We provide detailed description of these kernels in Table 1. This table also includes a recently proposed kernel Kreef (Fisher & Banerjee, 2010), specialized for spike response model 0 with reciprocal exponential exponential function (REEF) activation. This kernel integrates out the unknown parameters, hence eliminates the need to estimate any free parameters. However, the strict definiteness of the kernel has not been pursued.

Table 1:
List of Spike Train Kernels—Existing Ones and the Ones Proposed in the Letter.
NameKernelspdComplexityParametersReference
Count  No
Spikernela  No  Shpigelman et al. (2005
mCI  No  Paiva, Park, and Príncipe (2007
nCI  No  Paiva et al. (2009
Schoenberg (e)  Yes  Paiva et al. (2009
Schoenberg (i)  Yes
Stratified  Yes  1 (
REEK  Unknown  Fisher and Banerjee (2010
NameKernelspdComplexityParametersReference
Count  No
Spikernela  No  Shpigelman et al. (2005
mCI  No  Paiva, Park, and Príncipe (2007
nCI  No  Paiva et al. (2009
Schoenberg (e)  Yes  Paiva et al. (2009
Schoenberg (i)  Yes
Stratified  Yes  1 (
REEK  Unknown  Fisher and Banerjee (2010

Notes: N is the average number of spikes per spike train. fr is a rectangular smoothing function, the time interval is [0, T). is a gaussian kernel on the Euclidean distance. dmCI is the distance induced by the mCI kernel. is a Heaviside function. denotes binned spike train, is cardinality, In,m is the set of segments of length n from total length m, and Bi is a segment of B with first element at i1. Unfortunately, we could not find available implementation and could not implement the kernel from Eichhorn et al. (2004) due to lack of details.

a is the mapping from the binned spike train to the feature space where the kernel is evaluated as the inner product.

## 5.  Features

So far we have discussed kernels and induced dissimilarities and divergences, but we have not addressed what they are actually sensitive to when applied to data. In principle, dissimilarities are sensitive only to certain statistical differences, while divergences are sensitive to arbitrary statistical differences; however, in practice, their sensitivity depends on the choice of kernel and is limited by data. We explore the issue of sensitivity in two aspects: decomposition of divergence in orthonormal projections using KPCA and sensitivity analysis for explicit features using hypothesis testing.

### 5.1.  Kernel PCA-Based Decomposition.

Principal component analysis (PCA) is widely used to extract features—linear projections of data in a finite-dimensional Euclidean space obtained by finding directions of greatest variances. These features, or principal components (PCs), are simply obtained by the eigendecomposition of the data covariance matrix. Likewise, in kernel PCA (KPCA), one decomposes the covariance operator of the Hilbert space to find PCs that are now feature functionals, nonlinearly related to the input space (Schölkopf, Smola, & Müller, 1998). Each PC is orthogonal to each other and depends on the data as well as the kernel. Each kernel gives a different set of features, and the set of features for the same kernel depends on the spike trains in hand. This makes the interpretation of features difficult, yet it provides a useful qualitative insight of interpreting kernel induced divergence.

#### 5.1.1.  Divergence and KPCA Projections.

These KPCA features can be used to decompose the kernel-induced divergences. Given a set of orthonormal basis and eigenvalues provided by KPCA,
the dissimilarity or divergence, equation 2.1, can be trivially decomposed as
5.1
Hence, the total divergence (dissimilarity) is a sum of squared distances between the means and of each PC direction weighted by the eigenvalues. This allows us to visualize the differences of two point processes in each PC direction.
In practice, KPCA and decomposition of divergence can be computed with simple matrix algebra. Let be the set of spike trains originating from two point processes, and let K be the Gram matrix generated by these spike trains, , where is a predefined kernel, as discussed in the previous section. Then it can be easily seen that the estimated divergence (dissimilarity) is,
Now let be the matrix of eigenvectors obtained from the eigendecomposition related to KPCA, , where is a diagonal matrix containing the eigenvalues of the respective eigenvectors, and corresponding eigenfunctions in the RKHS are simply where zi’s are elements of X. Next, let , where bk is simply the projection of the spike trains in X on the kth eigenfunction, ; then . Notice that the normalizing factor is essential since the eigenfunctions are simply orthogonal but not orthonormal in this formulation, and . Then the sum of squared distances between the means of the projections is
where the s appear again since the projections exist on an orthogonal, not orthonormal, basis. Therefore, while discriminating two distributions through the KPCA projections, we plot for the best two eigenvectors for which are maximum since they explain away most of the divergence. Also, notice that KPCA is usually performed on the centered matrix Kc=HKH, where . However, centering does not change the relation between divergence and the projections since H is simply absorbed in the eigenvector matrix A.

#### 5.1.2.  Spike Trains with Count and Location Features.

In Figure 1, we visualize how a set of periodic spike trains with varying period and phase is embedded in the RKHS induced by each kernel. The data set consiststs of 16 spike trains for each of six different periods, each shifted such that the spike times uniformly span the interval (total 96 spike trains). The two-dimensional projection shows most kernels preserve the following features: (1) count (number of spikes, or equivalently period) and (2) phase. Notice that for each kernel, we choose the free parameters from a grid of values that provides maximum value of the divergence (see section 5.2 for details). The count kernel captures the first feature with PC1, and no other eigenvalue is significant, since the centered kernel matrix is rank 1. Kspikernel, KmCI, KnCI, KeSch, and Kreef also seem to favor having the count feature as PC1; in addition, PC2 mostly represents the phase feature, that is, the positions of the projections in this direction are ordered according to their phase. However, such interpretation should be made cautiously, since count and phase are not necessarily orthogonal in the RKHS. For example, KiSch mixes the two features along many PCs, and Kstrat reveals strata as maximally separated clusters.

Figure 1:

Two-dimensional visualization of a set of periodic spike trains with a phase and period distribution using kernel PCA (KPCA). The horizontal axis corresponds to the projection on the eigenvector for the largest eigenvalue (PC1) and the vertical axis for the second largest (PC2). The inset bar graphs show the first 10 normalized eigenspectra.

Figure 1:

Two-dimensional visualization of a set of periodic spike trains with a phase and period distribution using kernel PCA (KPCA). The horizontal axis corresponds to the projection on the eigenvector for the largest eigenvalue (PC1) and the vertical axis for the second largest (PC2). The inset bar graphs show the first 10 normalized eigenspectra.

Next, we form two sets of spike trains by dividing the periodic spike trains of Figure 1 in two different ways to focus on each feature. Set A divides them in terms of count while maintaining the uniformity over phase within each condition. Set B divides the spike trains within each stratum (of equal number of spikes) into early-phase and late-phase ones, thus maintaining count feature as constant. Therefore, set A and B have the same KPCA features, but the differences within are due to independent features (count statistic and location statistic; McFadden, 1965; Daley & Vere-Jones, 1988).

All kernels successfully reject the null for set A, as could be expected from sensitivity to count shown in Figure 1. In Figure 2, spike trains from each condition are projected to the PCs with the highest divergence components. Interestingly, the PC1 captured the difference the most except for the stratified kernel, where the divergence was spread over several PCs. The projections show clear separation of the two clusters; however, individual projections show a distinct structure of the feature space.

Figure 2:

Visualization of maximally divergent PC for difference in the count statistic. The spike train distribution of Figure 1 is divided into two subsets with the same phase distribution but a different count distribution. The upper half of the raster has a higher rate. The spike trains from each distribution conditioned on the count are plotted with different symbols, and the conditional mean is plotted as a solid symbol. Inset bar plots show the first 10 normalized divergence (see equation 5.1) per PC in the same order as the eigenspectra in Figure 1. Hypothesis testing is rejected for all kernels with p<0.01.

Figure 2:

Visualization of maximally divergent PC for difference in the count statistic. The spike train distribution of Figure 1 is divided into two subsets with the same phase distribution but a different count distribution. The upper half of the raster has a higher rate. The spike trains from each distribution conditioned on the count are plotted with different symbols, and the conditional mean is plotted as a solid symbol. Inset bar plots show the first 10 normalized divergence (see equation 5.1) per PC in the same order as the eigenspectra in Figure 1. Hypothesis testing is rejected for all kernels with p<0.01.

For set B (see Figure 3), the count statistic was held constant; therefore, Kcount trivially fails to discriminate them. Except for count kernel, the separation is clearly made, and again the distinct structure of the feature space contains the phase information. For stratified kernel, since the differences in phase are present in each of six strata, the divergence components are also spread. Note that the set of PCs with high divergence are often disjoint for sets A and B—Kspikernel, KnCI, KiSch and KeSch, and Kstrat—an indication that the count feature and location features are not mixed for this data set.

Figure 3:

Visualization of maximally divergent PC for difference in the location statistic. The spike train distribution of Figure 1 (and Figure 2) is divided into two subsets with the same count distribution but a different phase distribution. Hypothesis testing (as described in section 5.2) rejected the null for all kernels (p<0.05) except for Kcount (p>0.9). Note that the scatter plot for count kernel is merely showing a numerical error.

Figure 3:

Visualization of maximally divergent PC for difference in the location statistic. The spike train distribution of Figure 1 (and Figure 2) is divided into two subsets with the same count distribution but a different phase distribution. Hypothesis testing (as described in section 5.2) rejected the null for all kernels (p<0.05) except for Kcount (p>0.9). Note that the scatter plot for count kernel is merely showing a numerical error.

Although in this example the two sets of spike trains were disjoint, and hence separable in feature space, this, is not crucial. We emphasize that it is not necessary for the kernel to cluster the sample points to derive a good divergence measure, but only the distance between the mean element matters. Even if the support of the projection of each class completely overlays each other, theoretically the difference between the probability laws will be captured in the distance between the corresponding mean elements if the kernel is spd.

### 5.2.  Hypothesis Testing.

In the previous section, we used KPCA analysis to qualitatively assess feature sensitivity. Next, we quantify the power of each kernel with hypothesis testing. Given two sets of spike trains, we decide if the underlying point processes that generated each of them are distinct. By comparing the performance for the detection of different statistics, we can quantify which features of interest the kernels are sensitive to. We also compare the Wilcoxon rank sum test, a nonparametric test widely used in neuroscience.

• •

Surrogate test. To generate a distribution of the statistic under the null hypothesis, we use the pooled observations as surrogate. Given realizations and , we randomly divide the pooled observations in segments of size n and m to generate the surrogate values. This procedure can be implemented efficiently by shuffling the Gram matrix generated by the pooled observations appropriately (Gretton et al., 2007; Gretton, Fukumizu, Harchaoui, & Sriperumbudur, 2009). Each bootstrapping had 9999 random shufflings of the kernel matrix. For the simulation studies, we report the rejection rate of the hypothesis from 200 independent trials.

• •

Parameter Selection. Although strictly positive-definite kernels provide an efficient way of inducing measures of divergence in the space of spike trains, appropriate hypothesis testing requires choosing a suitable value for the free parameters, since the discriminability of two distributions heavily relies on the characteristics of the underlying kernel. Unfortunately, there are no established methods for this task. A trivial solution to this problem could be used to test the hypothesis over different values of the parameters and choose the one that provides the best discriminability. But this approach faces the problem of multiple testing. To avoid such situations, we consider a different approach, proposed by Sriperumbudur et al. (2009), by redefining the measure of divergence as , where is a set of kernels (e.g., kernels over different parameter values). It has been shown that is a measure of divergence if at least one in is strictly positive-definite (Sriperumbudur et al., 2009). We exploit this fact and use instead of to avoid multiple testing. We employ a grid of parameter values to construct from a certain as follows.

As mentioned before, spikernel uses binned representation. To reduce the computational load, we fix the bin size such that the number of bins would be 20 for the time interval of interest. Besides bin size, spikernel has four extra parameters (Shpigelman et al., 2005). We fix pi=1 and evaluate the kernel for 12 different parameter settings by combining , which are part of the range of parameters the authors chose (Shpigelman et al., 2005). We use the implementation in C provided in the Spider package (http://people.kyb.tuebingen.mpg.de/spider).

The free parameters for the other kernels are chosen to be 0.1, 0.5, and 0.9 quantiles of the numerator by random subsampling of spike trains, in addition to half of the 0.1 quantile and twice the 0.9 quantile (for a total of five choices). This is a variation of a common practice in kernel methods to match the kernel size to the scale of the data (Schölkopf, 1997). When the parameters are nested (KnCI, KeSch), the overall scale parameter was chosen for each value of the parameter inside (e.g., size of the smoothing), thus resulting in 25 choices of parameter pairs in total.

For the stratified kernel Kstrat, we use a gaussian kernel for each dimension. For each dimension, we scale the kernel size so that it matches the empirical mean l2 distance of a homogeneous Poisson process conditioned on the total number of spikes. The mean l2 distance of a homogeneous Poisson process in each stratum is a slowly growing function of dimension, where the rate of growth quickly reduces. The performance of such scaling is similar to using a single kernel size for all dimensions and does not change the conclusions.

#### 5.2.1.  Homogeneous Poisson Processes.

The most common feature is the mean count, or homogeneous mean firing rate. We test the discriminability on homogeneous Poisson processes with rate 2 and 4 spikes within the interval where the randomness is maximized given the mean rate (Rényi, 1964). In Figure 4A, the Wilcoxon test, and dissimilarities induced by Kcount, Kspikernel, KmCI, and Kreef, perform well in this task, rejecting more than 90% with just 24 trials. Note that the spd kernels, especially stratified kernels, perform poorly compared to Wilcoxon for this task. However, features other than count distribution are not captured by the Wilcoxon test and divergence is not induced by the count kernel. Therefore, for the subsequent experiments, we intentionally keep the mean firing rates of the two point processes constant in order to emphasize that the spd kernels are indeed sensitive to other statistical properties of the point processes.

Figure 4:

Performance of each kernel quantified by the rejection rate of hypothesis testing. There are five artificially generated point processes, A–E (see section 5.2 for details). Raster plots on the left show 20 realizations from each point process. For each pair of point processes, the bar plots show the rejection rates as a function of number of samples for each kernel. n is the number of spike trains in each class.

Figure 4:

Performance of each kernel quantified by the rejection rate of hypothesis testing. There are five artificially generated point processes, A–E (see section 5.2 for details). Raster plots on the left show 20 realizations from each point process. For each pair of point processes, the bar plots show the rejection rates as a function of number of samples for each kernel. n is the number of spike trains in each class.

#### 5.2.2.  Inhomogeneous Poisson Process.

As firing rate modulation over time is an abundantly observed phenomenon in neural code experiments, we test the relative performance of each measure. We use two inhomogeneous Poisson processes with same mean rate (five spikes in the interval) but with a different rate profile. The interval is divided in half, and a step change in the intensity function was introduced such that the first half and second half are constant rates. The rates were chosen to be two and three spikes per half interval for the two halves, respectively, for one process and three and two spikes for the other process.

Due to the high dispersion of count distribution that is identical for both processes, Kstrat poorly performs but does not completely fail (see Figure 4B). KmCI is designed to be sensitive to the (smoothed) mean rate and performs well, as expected. Kspikernel, KnCI, KeSch, KiSch performs on par with KmCI.

#### 5.2.3.  Two Spikes with Correlation.

Neurons can integrate and hence be history dependent on firing or can have a short membrane time constant and forget the past spiking. We explore a simple and instructive example to show if the divergences can detect such differences in temporal correlations. We investigate two point processes having at most two spikes each, where the processes share the same marginal intensity function but have different correlation structures (Seth, Park et al., 2010). In the first process, the two event timings are correlated, that is, the interspike interval (ISI) has a narrow distribution, whereas in the second process, the two event timings are independent, that is, each action potential has its own precise timing distribution. Both point processes have a lossy noise; each action potential has a probability of missing with probability p=0.1. The temporal jitter of each spike is 100 ms, and the interval is 300 ms. Since the highest dimension of the problem is 2, this problem is easier for the stratified kernel (see Figure 4C). Nevertheless, all spd kernel–induced divergences quickly catch up as the number of sample increases, while pd kernels are unable to discriminate.

#### 5.2.4.  Renewal Versus Poisson Process.

In neural systems, spike timing features that deviate from a Poisson process, such as a refractory period, oscillation, and bursting, are often modeled with a renewal process (Nawrot et al., 2008). In this experiment, we compare a stationary renewal process with gamma interval distribution and an equi-rate homogeneous Poisson process (see Figure 4D). We observe that Kspikernel performs the best, and KnCI performs on par with the spd kernels. However, recall that this performance of Kspikernel comes at the price of a higher computational cost.

#### 5.2.5.  Precisely Timed Spike Trains.

When the same stimulation is presented to a neuronal system, the observed spike trains sometimes show a highly repeatable spatiotemporal pattern at the millisecond timescale. Recently these precisely timed spike trains (PTST) have been abundantly reported in both in vivo and in vitro preparations (Reinagel & Reid, 2002; DeWeese, Wehr, & Zador, 2003; Johansson & Birznieks, 2004). Despite being highly reproducible, different forms of trial-to-trial variability have also been observed (Tiesinga, Fellous, & Sejnowski, 2008).

We model a precisely timed spike train in an interval by L pairs of probability density and probability pairs . Each fi(t) corresponds to the temporal jitter, and pi corresponds to the probability of generating (otherwise missing) the spike. Each realization of the PTST model produces at most L spikes, that is, L determines the maximum dimension for the PTST. The equi-intensity Poisson process has the rate function . We test if the kernels can differentiate between the PTST and the equi-intensity Poisson process for L = 3. We choose fi(t) to be equal-variance gaussian distributions on a grid sampled from a uniform random variable and set pi=0.9.

We observe that the spd kernels again succeed in correctly rejecting the null hypothesis (see Figure 4E). However, several pd kernels, Kspikernel and KnCI, also perform on par, while the other pd kernels fail similarily the renewal process experiment. The success of these pd kernels can be explained in terms of their rather extensive construction compared to the the simpler kernels such as KmCI, which are linear in nature.

## 6.  Real Data

### 6.1.  Retinal Spiking Response.

To test the practical effectiveness of kernel-induced point-process divergences, we applied them to publicly available sensory response data.2 We used the rat retinal ganglion cell recordings from Nirenberg's lab (Jacobs, Grzywacz, & Nirenberg, 2006). There were 22 stimulus conditions repeated 30 times while recording 3.1 seconds long spike trains from 15 neurons. The task is to discriminate different stimulus conditions given the spike trains from a single neuron.

We compute the rejection rate of the result of hypothesis testing. If an experimenter were to select neurons that were sensitive to differences in the stimulus conditions, the rejection rate can be interpreted as the probability that he or she would correctly categorize the neuron as being sensitive (given a fixed false-positive rate of 10%). The rejection rate of the null hypothesis is shown in Figure 5. We organized the stimulus conditions in two ways: we randomly paired the 22 stimulus conditions to form 11 hypothesis tests, or, for each neuron, we paired the stimulus conditions to have a close median firing rate response by sorting the conditions and pairing neighbors for the test. The latter pairing is to emphasize the detectability for features other than mean count.

Figure 5:

Performance on spike trains recorded from retinal ganglion neurons. (A) Rejection rate of each divergence measure for neighboring rate conditions. The dashed line is the test size (p=0.1). The errors bars show standard error over 165 tests. (B) Rejection rate for randomized stimulus conditions. (C) Thirty spike trains from two neighboring stimulus conditions for cell 113. Mean number of spikes per trial was 30. Only Schoenberg (e) and nCI rejected (p<0.05). mCI was the next closest to rejecting (p=0.11).

Figure 5:

Performance on spike trains recorded from retinal ganglion neurons. (A) Rejection rate of each divergence measure for neighboring rate conditions. The dashed line is the test size (p=0.1). The errors bars show standard error over 165 tests. (B) Rejection rate for randomized stimulus conditions. (C) Thirty spike trains from two neighboring stimulus conditions for cell 113. Mean number of spikes per trial was 30. Only Schoenberg (e) and nCI rejected (p<0.05). mCI was the next closest to rejecting (p=0.11).

For spikernel, we spanned 24 parameters (the same as the simulation section, but allowing a bin size of 50 ms as well as 150 ms). Despite being most optimized, Kspikernel was the slowest kernel and yet did not perform as well as KmCI. The difference between spikernel and mCI, nCI, Schoenberg kernels was more pronounced for the neighboring rate pair condition. Overall KeSch and KnCI were the best, getting close to rejecting all responses due to different stimuli. Stratified kernel performed the worst due to the large number of spikes (on average 20 to 50 spikes per trial).

### 6.2.  Directionality Detection Task in Monkey MT.

Area MT (middle temporal) of the visual system is known to have neurons tuned to directional visual flow. The classical theory of MT is that the mean firing rate of these neurons encodes the strength of directional motion (Albright, 1984). Detecting such a tuning without any underlying assumptions, and with a small number of repeats, is often crucial for neuroscientists. We apply the proposed kernel-based divergences to single MT neuron recordings from a detection task performed by monkeys (Britten, Newsome, Shadlen, Celebrini, & Movshon, 1996).

We extracted a data set from a freely available online repository to obtain responses from 180 direction-sensitive MT neurons (Britten et al., 1996).3 We selected 30 trials for high-coherence levels of 12.8% and 25.6% (or low-coherence levels 3.2% and 6.4% for a more difficult task), each for the preferred direction and null direction. Constant strength motion stimulus is presented for 2 seconds, and we select a window at the end of 2 seconds that corresponds to average 3 spikes over the selected set of trials (for both directions).

Consistent with the rate coding under constant stimuli, the hypothesis testing for both pairs of conditions suggests that mean spike count is the main feature for discrimination (see Figure 6). The high-coherence task is easier for the monkey as well as for the hypothesis test.

Figure 6:

Distinguishing the preferred direction from the null direction stimulus from single-neuron responses in MT. (A) High-coherence trials. (B) Low-coherence trials. The gray area represents the fraction of rejections that agrees with Wilcoxon rank sum test.

Figure 6:

Distinguishing the preferred direction from the null direction stimulus from single-neuron responses in MT. (A) High-coherence trials. (B) Low-coherence trials. The gray area represents the fraction of rejections that agrees with Wilcoxon rank sum test.

Interestingly the kernels that showed better performance than the Wilcoxon rank sum task for simulations and results from the retina data set performed worse in this data set. This data set is collected from single-electrode electrophysiology where the experimenter explored the neurons in MT and selected direction-selective cells (Britten et al., 1996). We argue that this selection bias could have resulted in a strong discriminability feature in the mean count, and therefore simpler tests performed better, while spd kernels are distracted by other features of the data.

## 7.  Conclusion

In this letter, we have studied strictly positive-definite kernels on the space of spike trains. A strictly positive-definite kernel is more powerful than a positive-definite kernel in the sense that it offers a unique functional representation of a probability law. Therefore, the former induces a divergence—a tool that can be applied in testing equality of distribution—while the latter induces only a dissimilarity. However, the performance of a kernel is governed not only by its mathematical properties but also by its computational simplicity and choice of free parameters. We have performed an extensive set of simulations to empirically compare the performance and complexity of both the existing positive-definite kernels and the proposed strictly definite kernels qualitatively, through kernel PCA, and quantitatively, through hypothesis testing, on both synthetic and real data. These results reveal useful insights into the characteristics of these kernels and also explore their practicality.

The simulation results reveal that some pd kernels often perform equally well compared to spd kernels. This observation is counterintuitive but not surprising; although spd kernels should asymptotically perform better in theory, this does not guarantee good performance for the small-sample regime. In fact, it is natural for a pd kernel designed to capture the exact discriminating statistical feature to outperform a generic spd kernel. For example, if a neuronal system exhibits a rate-coding scheme, then it is sufficient, and perhaps best, to use a kernel that is sensitive only to the rate of the neural signal. Therefore, the choice of kernel when studying neural codes should depend on how information is encoded and what the trial-to-trial variability structure is. However, assuming a neural code beforehand can lead to bias in the analysis; therefore, when in doubt, it is advantageous to have a generalist spd kernel.

We have also observed that for the real data sets, the count kernel, which fails for all but one artificial data set by design, often performs on par with or better than the sophisticated kernels in many data sets, including one shown in section 6.2. We believe that this is because the neurons were selected under the assumption that the mean firing rate, which is the statistic that the count kernel or Wilcoxon test is sensitive to, is modulated. It is possible that neurons that modulate responses in different ways are overlooked due to the unavailability of state-of-the-art analysis tools. Therefore, the use of strictly positive-definite kernels would potentially allow new discoveries through observing these seemingly irrelevant neuronal responses. We conclude by stating two partially novel aspects of this framework that we have not explicitly addressed in this letter:

1. An spd (or a pd) kernel induces a (pseudo-) metric, since the kernel can be treated as an inner product in a reproducing kernel Hilbert space (Schölkopf & Smola, 2002). Therefore, in essence, this approach also provides families of distance measures. This attribute relates this letter to work on spike train metrics (Victor & Purpura, 1997; van Rossum, 2001). However, both Victor and Purpura (1997) and van Rossum (2001) restrict themselves to designing distance measures between two spike train realizations of point processes, whereas our work is more general since it introduces distance between two point processes (or collection of spike trains).

2. The use of strictly positive-definite kernels is not restricted to the design of divergence; the characteristic nature of such kernels allows them to embed a probability law—a point process in our case—as an element in the reproducing kernel Hilbert space (Smola, Gretton, Song, & Schölkopf, 2007). This allows probabilistic operations such as Bayesian inference to be performed in terms of functional operations without the need of estimating or approximating the probability law (Song, Boots, Siddiqi, Gordon, & Smola, 2010). Therefore, the content of this letter naturally links these advanced machine learning tools to neuroscience.

We leave exploring these directions for future work. The code for spike train kernels and divergence tests is available at the author's Web site: http://www.memming.com.

## Acknowledgments

We are grateful to the referees for their very helpful comments, which improved this letter significantly. This work was supported by NSF grant ECCS-0856441 and DARPA grant N66001-10-C-2008.

## Notes

1

Strict in a measure-theoretic sense, which is stronger than the usual countable summation sense (see definition 1).

2

Available online from http://neurodatabase.org (Gardner, 2004).

3

Neural Signal Archive: http://www.neuralsignal.org/.

## References

References
Albright
,
T. D.
(
1984
).
Direction and orientation selectivity of neurons in visual area MT of the macaque
.
Journal of Neurophysiology
,
52
(
6
),
1106
1130
.
Berg
,
C.
,
Christensen
,
J. P. R.
, &
Ressel
,
P.
(
1984
).
Harmonic analysis on semigroups: Theory of positive definite and related functions
.
New York
:
Springer-Verlag
.
Britten
,
K. H.
,
Newsome
,
W. T.
,
,
M. N.
,
Celebrini
,
S.
, &
Movshon
,
J. A.
(
1996
).
A relationship between behavioral choice and the visual responses of neurons in macaque MT
.
Visual Neuroscience
,
13
(
1
),
87
100
.
Christmann
,
A.
, &
Steinwart
,
I.
(
2010
).
Universal kernels on non-standard input spaces
. In
J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta
(Eds.),
Advances in neural information processing systems, 23
(pp.
406
414
).
Red Hook, NY
:
Curran Associates
.
Churchland
,
M. M.
,
Yu
,
B. M.
,
Cunningham
,
J. P.
,
Sugrue
,
L. P.
,
Cohen
,
M. R.
,
,
G. S.
, et al
(
2010
).
Stimulus onset quenches neural variability: A widespread cortical phenomenon
.
Nature Neuroscience
,
13
(
3
),
369
378
.
Daley
,
D. J.
, &
Vere-Jones
,
D.
(
1988
).
An introduction to the theory of point processes
.
New York
:
Springer
.
Dayan
,
P.
, &
Abbott
,
L. F.
(
2001
).
Theoretical neuroscience: Computational and mathematical modeling of neural systems
.
Cambridge, MA
:
MIT Press
.
DeWeese
,
M. R.
,
Wehr
,
M.
, &
,
A. M.
(
2003
).
Binary spiking in auditory cortex
.
Journal of Neuroscience
,
23
(
21
),
7940
7949
.
Diks
,
C.
, &
Panchenko
,
V.
(
2007
).
Nonparametric tests for serial independence based on quadratic forms
.
Statistica Sinica
,
17
,
81
98
.
Eichhorn
,
J.
,
Tolias
,
A.
,
Zien
,
A.
,
Kuss
,
M.
,
Rasmussen
,
C. E.
,
Weston
,
J.
, et al
et al. (
2004
).
Prediction on spike data using kernel algorithms
. In
S. Thrün, L. Soul, & B. Schölkopf
(Eds.),
Advances in neural information processing systems, 16
(pp.
1367
1374
).
Cambridge, MA
:
MIT Press.
Fisher
,
N. K.
, &
Banerjee
,
A.
(
2010
).
A novel kernel for learning a neuron model from spike train data
. In
J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta
(Eds.),
Advances in neural information processing systems, 23
(pp.
595
603
).
Red Hook, NY
:
Curran Associates
.
Gardner
,
D.
(
2004
).
Neurodatabase.org: networking the microelectrode
.
Nature Neuroscience
,
7
(
5
),
486
487
.
Gretton
,
A.
,
Borgwardt
,
K.
,
Rasch
,
M.
,
Schölkopf
,
B.
, &
Smola
,
A.
(
2007
).
A kernel method for the two-sample-problem
. In
B. Schölkopf, J. Platt, & T. Hoffman
(Eds.),
Advances in neural information processing systems, 19
(pp.
513
520
).
Cambridge, MA
:
MIT Press
.
Gretton
,
A.
,
Fukumizu
,
K.
,
Harchaoui
,
Z.
, &
Sriperumbudur
,
B.
(
2009
).
A fast, consistent kernel two-sample test
. In
Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams, & A. Culotta
(Eds.),
Advances in neural information processing systems
,
22
.
Red Hook, NY
:
Curran Associates
.
Houghton
,
C.
(
2009
).
Studying spike trains using a van Rossum metric with a synapse-like filter
.
Journal of Computational Neuroscience
,
26
(
1
),
149
155
.
Hubel
,
D. H.
, &
Wiesel
,
T. N.
(
1959
).
Receptive fields of single neurones in the cat's striate cortex
.
Journal of Physiology
,
148
,
574
591
.
Jacobs
,
A.
,
Grzywacz
,
N.
, &
Nirenberg
,
S.
(
2006
).
Decoding the parallel pathways of the retina
. In
Society for Neuroscience
.
Program No. 47.10/G1
.
Johansson
,
R. S.
, &
Birznieks
,
I.
(
2004
).
First spikes in ensembles of human tactile afferents code complex spatial fingertip events
.
Nature Neuroscience
,
7
(
2
),
170
177
.
Kang
,
K.
, &
Amari
,
S.
(
2008
).
Discrimination with spike times and ISI distributions
.
Neural Computation
,
20
(
6
),
1411
1426
.
Li
,
L.
,
Park
,
I. M.
,
Seth
,
S.
,
Choi
,
J.
,
Francis
,
J. T.
,
Sanchez
,
J. C.
, et al. (
2011
).
An adaptive decoder from spike trains to micro-stimulationusing kernel least-mean-square (KLMS) algorithm
. In
Proceedings, IEEE Machine Learning for Signal Processing
,
Piscataway, NJ
:
IEEE
.
,
J. A.
(
1965
).
The entropy of a point process
.
Journal of the Society for Industrial and Applied Mathematics
,
13
(
4
),
988
994
.
Naud
,
R.
,
Gerhard
,
F.
,
Mensi
,
S.
, &
Gerstner
,
W.
(
2011
).
Improved similarity measures for small sets of spike trains
.
Neural Computation
,
23
(
12
),
3016
3069
.
Nawrot
,
M. P. P.
,
Boucsein
,
C.
,
Molina
,
V. R.
,
Riehle
,
A.
,
Aertsen
,
A.
, &
Rotter
,
S.
(
2008
).
Measurement of variability dynamics in cortical spike trains
.
Journal of Neuroscience Methods
,
169
(
2
),
374
390
.
Paiva
,
A. R. C.
,
Park
,
I.
, &
Príncipe
,
J. C.
(
2007
).
Innovating signal processing for spike train data
. In
Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology Society
.
Piscataway, NJ
:
IEEE
.
Paiva
,
A. R. C.
,
Park
,
I.
, &
Príncipe
,
J. C.
(
2009
).
A reproducing kernel Hilbert space framework for spike trains
.
Neural Computation
,
21
(
2
),
424
449
.
Paiva
,
A. R. C.
,
Park
,
I.
, &
Príncipe
,
J. C.
(
2010
).
A comparison of binless spike train measures
.
Neural Computing and Applications
,
19
,
405
419
.
Park
,
I.
, &
Príncipe
,
J. C.
(
2010
).
Quantification of inter-trial non-stationarity in spike trains from periodically stimulated neural cultures
. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(pp.
5442
5445
).
Piscataway, NJ
:
IEEE
.
Pekalska
,
E.
, &
Duin
,
R. P. W.
(
2005
).
The dissimilarity representation for pattern recognition
.
River Edge, NJ
:
World Scientific Publishing
.
Perkel
,
D. H.
,
Gerstein
,
G. L.
, &
Moore
,
G. P.
(
1967
).
Neuronal spike trains and stochastic point processes. I. The single spike train
.
Biophysical Journal
,
7
(
4
),
391
418
.
Pinkus
,
A.
(
2004
).
Strictly Hermitian positive definite functions
.
Journal d'Analyse Mathématique
,
94
(
1
),
293
318
.
Rao
,
M.
,
Seth
,
S.
,
Xu
,
J.
,
Chen
,
Y.
,
Tagare
,
H.
, &
Príncipe
,
J. C.
(
2011
).
A test of independence based on a generalized correlation function
.
Signal Processing
,
91
(
1
),
15
27
.
Reinagel
,
P.
, &
Reid
,
R. C.
(
2002
).
Precise firing events are conserved across neurons
.
Journal of Neuroscience
,
22
(
16
),
6837
6841
.
Rényi
,
A.
(
1964
).
On an extremal property of the Poisson process
.
Annals of the Institute of Statistical Mathematics
,
16
(
1
),
129
133
.
Schaback
,
R.
, &
Wendland
,
H.
(
2001
).
Multivariate approximation and applications
.
Cambridge
:
Cambridge University Press
.
Schölkopf
,
B.
(
1997
).
Support vector learning
.
Unpublished doctoral dissertation, Berlin Technical University
.
Schölkopf
,
B.
, &
Smola
,
A. J.
(
2002
).
Learning with kernels: Support vector machines, regularization, optimization, and beyond
.
Cambridge, MA
:
MIT Press
.
Schölkopf
,
B.
,
Smola
,
A.
, &
Müller
,
K. R.
(
1998
).
Nonlinear component analysis as a kernel eigenvalue problem
.
Neural Computation
,
10
(
5
),
1299
1319
.
Schrauwen
,
B.
, &
Campenhout
,
J.
(
2007
).
Linking non-binned spike train kernels to several existing spike train metrics
.
Neurocomputing
,
70
(
7–9
),
1247
1253
.
Seth
,
S.
,
Park
,
I.
,
Brockmeier
,
A.
,
Semework
,
M.
,
Choi
,
J.
,
Francis
,
J.
, et al
et al. (
2010
).
A novel family of non-parametric cumulative based divergences for point processes
. In
J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta
(Eds.),
Advances in neural information processing systems, 23
(pp.
2119
2127
).
Red Hook, NY
:
Curran Associates.
Seth
,
S.
,
Rao
,
M.
,
Park
,
I. M.
, &
Príncipe
,
J. C.
(
2010
).
A unified framework for quadratic measures of independence
.
IEEE Transactions on Signal Processing
,
59
,
3624
3635
.
Shpigelman
,
L.
,
Singer
,
Y.
,
Paz
,
R.
, &
,
E.
(
2005
).
Spikernels: Predicting arm movements by embedding population spike rate patterns in inner-product spaces
.
Neural Computation
,
17
(
3
),
671
690
.
Smola
,
A.
,
Gretton
,
A.
,
Song
,
L.
, &
Schölkopf
,
B.
(
2007
).
A Hilbert space embedding for distributions
. In
ALT ’07: Proceedings of the 18th International Conference on Algorithmic Learning Theory
(pp.
13
31
).
Berlin
:
Springer-Verlag
.
Song
,
L.
,
Boots
,
B.
,
Siddiqi
,
S.
,
Gordon
,
G.
, &
Smola
,
A.
(
2010
).
Hilbert space embeddings of Hidden Markov models
. In
Proceedings of the International Conference on Machine Learning
.
:
Omnipress
.
Sriperumbudur
,
B.
,
Fukumizu
,
K.
,
Gretton
,
A.
,
Lanckriet
,
G.
, &
Schölkopf
,
B.
(
2009
).
Kernel choice and classifiability for RKHS embeddings of probability distributions
. In
Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta
(Eds.),
Advances in neural information processing systems, 22
(pp.
1750
1758
).
Red Hook, NY
:
Curran Associates
.
Sriperumbudur
,
B.
,
Fukumizu
,
K.
, &
Lanckriet
,
G.
(
2010
).
On the relation between universality, characteristic kernels and RKHS embedding of measures
.
Journal of Machine Learning Research
,
9
,
773
780
.
Sriperumbudur
,
B. K.
,
Gretton
,
A.
,
Fukumizu
,
K.
,
Lanckriet
,
G.
, &
Schölkopf
,
B.
(
2008
).
Injective Hilbert space embeddings of probability measures
. In
Proc. of the 21st Annual Conference on Computational Learning Theory
.
:
Curran Associates
.
Tiesinga
,
P.
,
Fellous
,
J.-M.
, &
Sejnowski
,
T. J.
(
2008
).
Regulation of spike timing in visual cortical circuits
.
Nature Reviews Neuroscience
,
9
,
97
107
.
van Rossum
,
M. C. W.
(
2001
).
A novel spike distance
.
Neural Computation
,
13
,
751
763
.
Victor
,
J. D.
(
2002
).
Binless strategies for estimation of information from neural data
.
Physical Review E
,
66
(
5
),
051903
.
Victor
,
J. D.
, &
Purpura
,
K. P.
(
1997
).
Metric-space analysis of spike trains: Theory, algorithms and application
.
Network: Computation in Neural Systems
,
8
(
2
),
127
164
.