## Abstract

In studies of the nervous system, the choice of metric for the neural responses is a pivotal assumption. For instance, a well-suited distance metric enables us to gauge the similarity of neural responses to various stimuli and assess the variability of responses to a repeated stimulus—exploratory steps in understanding how the stimuli are encoded neurally. Here we introduce an approach where the metric is tuned for a particular neural decoding task. Neural spike train metrics have been used to quantify the information content carried by the timing of action potentials. While a number of metrics for individual neurons exist, a method to optimally combine single-neuron metrics into multineuron, or population-based, metrics is lacking. We pose the problem of optimizing multineuron metrics and other metrics using centered alignment, a kernel-based dependence measure. The approach is demonstrated on invasively recorded neural data consisting of both spike trains and local field potentials. The experimental paradigm consists of decoding the location of tactile stimulation on the forepaws of anesthetized rats. We show that the optimized metrics highlight the distinguishing dimensions of the neural response, significantly increase the decoding accuracy, and improve nonlinear dimensionality reduction methods for exploratory neural analysis.

## 1. Introduction

Systematic analysis of neural activity is used to investigate the brain's representation of different conditions such as stimuli, actions, intentions, cognitive states, or affective states. The fundamental questions of the analysis are, “How much information does the neural response carry about the condition?” and, “How is this information represented in the neural response?” Information theory can be used to assess the first question, but in practice its application is nontrivial without answering the second question first due to the challenge in estimating information-theoretic quantities for neural data.

As neural signals vary across time and space, many electrodes and high-sampling rates are necessary to accurately record the spatiotemporal activity. The neural responses may be recorded in different forms invasively or noninvasively. Electrode arrays allow invasive recordings to capture both the timing of action potentials (spike trains) across many neurons and local field potentials (LFPs) across many electrodes.

Estimating decoding models, let alone an information-theoretic quantity such as mutual information, is difficult on diverse, high-dimensional data. Consequently, it is necessary to explicitly optimize an alternative, possibly low-dimensional, representation of the neural signals where it is easier to gauge their information content relevant to the experimental condition. At this point, it is necessary to make a distinction between the objective here—learning the representation of the neural response that is most relevant for the different conditions—and the neural coding problem—learning the representation of the stimulus space that is most relevant for the neural response.

An optimized representation of the neural response is useful for visualizing the single-trial variability (Churchland et al., 2010) and the similarity between neural responses to similar conditions (Brockmeier et al., 2013). The neural response representation may be optimized in a supervised or unsupervised manner.

A number of unsupervised methods have been used for the exploratory analysis of neural data (Stopfer, Jayaraman, & Laurent, 2003; Broome, Jayaraman, & Laurent, 2006; Brockmeier, Park, Mahmoudi, Sanchez, & Principe, 2010; Brockmeier, Kriminger, Sanchez, & Principe, 2011; Park, Seth, Rao, & Príncipe, 2012). Principal component analysis (PCA) and clustering are among the simplest, but the results from PCA may be unsatisfactory on neural data (Cowley et al., 2012). Other nonlinear approaches include distance embeddings (Sammon, 1969), kernel-based extension to PCA (Schölkopf, Smola, & Müller, 1998), and manifold learning algorithms that try to preserve the similarity structure between samples in a low-dimensional representation. Methods tend to concentrate on preserving either local (Roweis & Saul, 2000) or structural-based (Tenenbaum, De Silva, & Langford, 2000) similarities; novel samples can be mapped to the representation space using explicit mappings (Bunte, Biehl, & Hammer, 2012).

State-space models provide an unsupervised approach to explore the temporal evolution and variability of neural responses during single trials, and those with low-dimensional or discrete state variables (hidden Markov models) are useful to visualize the dynamics of the neural response (Radons, Becker, Dülfer, & Krüger, 1994; Seidemann, Meilijson, Abeles, Bergman, & Vaadia, 1996; Kemere et al., 2008; Yu et al., 2009a, 2009b). Petreska et al. (2011) have shown how a combination of temporal dynamics and a discrete state can efficiently capture the dynamics in a neural response. Henceforth, we consider only static representations of responses to a discrete number of conditions, or classes, collected over multiple trials.

In the supervised case, Fisher discriminant analysis (FDA) and extensions (Fukunaga, 1990; Baudat & Anouar, 2000; Sugiyama, 2007) use sample covariances from each class to form discriminative projections. The optimal projection is a solution to a generalized eigenvalue problem that maximizes the spread between samples in different classes while minimizing the spread of samples within the same class.

The dimensionality of the neural response can be reduced by feature selection (Kira & Rendell, 1992). The simplest approach is to find how informative each feature is for a given task and then select a set of informative, but not redundant, features (Guyon & Elisseeff, 2003; Peng, Long, & Ding, 2005; Yamada, Jitkrittum, Sigal, Xing, & Sugiyama, 2013), or a set of features may be obtained by backward- or forward-selection algorithms (Song, Bedo, Borgwardt, Gretton, & Smola, 2007; Song, Smola, Gretton, Bedo, & Borgwardt, 2012).

In a broad sense, linear projections and dimensionality reduction are just special cases of metric learning algorithms, which change the metric in the original space to achieve some specific objective, usually exploiting supervisory information (Lowe, 1995; Xing, Ng, Jordan, & Russell, 2003; Fukumizu, Bach, & Jordan, 2004). Often metric learning is used as an intelligent preprocessing for classification methods that depend on a measure of similarity or dissimilarity to determine if two samples are of the same class. For instance, kernel machines or nearest-neighbor-based approaches compare novel samples relative to already observed samples but rely on a predefined similarity measure. These classifiers are highly dependent on the preprocessing and offer little insight into the importance of individual feature dimensions. Metric learning can improve the classification performance by adjusting the importance of individual features for the task (Lowe, 1995; Takeuchi & Sugiyama, 2011), and these weights can be used to highlight the features or dimensions relevant for the objective. Furthermore, metric learning approaches can also improve kernel regression (Fukumizu et al., 2004; Navot, Shpigelman, Tishby, & Vaadia, 2006; Weinberger & Tesauro, 2007).

We investigate classes of metrics that are parameterized along spatiotemporal dimensions of neural responses. For instance, it is natural to consider which channels or time points in the neural response are most useful for distinguishing among conditions. Unlike previous metric learning approaches that have concentrated on learning projections and weightings for scalar-valued variables, here we also explore using metric learning where the weights correspond to different neurons in multiunit spike train metrics or vectors of spatial amplitudes from multichannel LFPs.

With vector-valued data, each individual metric is defined over a vector space. Using a weighted combination of these metrics, we can form strictly spatial or temporal weightings for multivariate time series. In addition, we propose to optimize multineuron spike train metrics (Aronov, 2003; Houghton & Sen, 2008) formed as combinations of spike train metrics defined for individual neurons (Victor & Purpura, 1996; Van Rossum, 2001; Paiva, Park, & Príncipe, 2009). To our knowledge, ours is the first attempt to explicitly optimize the parameters of multineuron spike train metrics instead of using predefined weightings.

Given the form of the projections or weightings, one must consider their optimization. A number of metric learning cost functions have been posed in the literature, but we propose using a kernel-based measure of dependence known as centered alignment (Cortes, Mohri, & Rostamizadeh, 2012). Centered alignment was shown to be a useful measure for kernel learning (Cortes et al., 2012), and a similar but unnormalized kernel dependence measure, the Hilbert-Schmidt information criterion (HSIC), has been used for feature selection (Song et al., 2012). Another kernel-based dependence measure formulated based on conditional entropy (Sanchez Giraldo & Principe, 2013) has also been shown to be useful for learning a Mahalanobis distance (Sanchez Giraldo & Principe, 2013; Brockmeier et al., 2013). The objective and optimization techniques used here are most similar to those proposed by Fukumizu et al. (2004), but by replacing the kernel-based canonical correlation measure (Bach & Jordan, 2003) with centered kernel alignment we avoid both matrix inversion and regularization.

Using the kernel-based objective, we highlight the connection between optimizing weighted tensor product kernels and metric learning. Optimizing a metric in a kernel-based framework has the added benefit that it naturally optimizes the kernel itself for use in support vector machines. This eliminates the need for the user to choose a kernel size through cross-validation or trial and error. Kernels also provide a straightforward way to form metrics corresponding to nonlinear projections. This is done by retrieving a metric from an unweighted sum of optimized kernels, an approach distinct from optimizing a convex sum of kernels (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004). Ultimately this leads us to propose optimized multineuron spike train kernels formed as the product and the sum of product of single-unit spike train kernels (Paiva et al., 2009; Park et al., 2012; Park, Seth, Paiva, Li, & Principe, 2013).

The rest of the letter is organized as follows. We first introduce the mathematical representation of the neural response and different metrics that use linear projections or weightings; from the metrics, we form kernel-based similarity measures; from the kernels, we introduce the dependence measure (Cortes et al., 2012); and from these, we form metric learning optimization problems. We verify the classification performance of the proposed approach on benchmark data sets, and show results on experimental data sets consisting of both local field potentials (LFPs) and spike trains. We conclude with a discussion of the results, the connection of metric learning to neural encoding, and future applications.

## 2. Metrics and Similarity Functions

### 2.1. Neural Data Representation and Metrics.

For the trial-wise classification of different conditions, a sample from each trial is the concatenation of the neural response across all selected time samples, electrode channels, or neural spiking units. Let denote the *P*-dimensional neural response to a given trial, where parenthetical subscripts denote the response dimension: *x*_{(i)} may be a scalar, vector, or set (in the case of spike trains). Let *x _{j}* denote the neural response for the

*j*th trial, , and let denote the discrete class label corresponding to a certain class condition for the

*j*th trial. The set represents a joint sample of the neural responses and labels.

Consider the distances between pairs of neural responses along each dimension: the distance between samples on the *i*th dimension is denoted . For instance, this may be the Euclidean metric for scalars or vectors or a metric on spike trains (Van Rossum, 2001; Paiva et al., 2009; Dubbs, Seiler, & Magnasco, 2010).

*w*denote a nonnegative

*P*-dimensional weighting vector, such that . The metric using this weighting is formed as where the exponent controls how relatively large distances on individual dimensions contribute to the total distance. This is a weighted Euclidean metric if and the metric for each dimension is the Euclidean or metric.

If *w _{i}*=0, then the metric is actually a pseudo-metric since it does not satisfy the property that if and only if . However, this invariance to certain dimensions is a goal of metric learning, and we will no longer make the distinction between pseudo-metrics and metrics. For vector-valued data, the weighting is a special case of a linear transformation that is equivalent to globally scaling the input dimensions: .

*Q*-dimensional feature space, is equivalent to a Mahalanobis distance in the original space, with the inverse covariance matrix replaced by the symmetric positive-definite matrix

*AA*

^{T}: As

*A*has coefficients, there are more degrees of freedom to distort the space according to this metric. This matrix is strictly positive definite if

*P*=

*Q*and

*AA*

^{T}is full rank; otherwise the metric is actually a pseudo-metric.

*A*is diagonal and square, with

*P*diagonal entries . More generally, a Mahalanobis distance can be seen as a weighting over a squared distance matrix between all dimensions. Using properties of the trace, where . Unless

*A*is diagonal, this metric exploits the inner product between different dimensions. Written in this form, it is clear how a Mahalanobis-type metric can be formed whenever all the dimensions of the neural response correspond or can be mapped to the same Hilbert space. Specifically, let define a mapping from any element to an element in the Hilbert space . The Mahalanobis-type metric in this space is defined by equation 2.3, where . As long as the inner product can be defined between the dimensions, for instance, by using the spike train kernels discussed in section 2.3, one can form metrics that use the distance between different spiking units. This would replicate the interaction between spikes on different units intrinsic to some multiunit metrics (Aronov, 2003). However, evaluating the inner product between each pair of dimensions for every pair of samples is computationally demanding and is not investigated here.

### 2.2. Kernels for Neural Responses.

Kernel functions are bivariate measures of similarity based on the inner product between samples embedded in a Hilbert space. Let the domain of the neural response be denoted and consider a kernel . If is positive definite, there is an implicit mapping that maps any element to an element in the Hilbert space such that .

Because we want to explore the similarity across the individual dimensions of the data, we compose a joint similarity measure from the marginal similarity on each dimension. Let denote the neural response domain of the *i*th dimension and consider a positive-definite kernel and corresponding mapping for this dimension. The similarity between a pair of samples *x* and on the *i*th dimension is .

The joint similarity over both dimensions *i* and *j* is computed by taking the product between the kernel evaluations . The new kernel is called a tensor product kernel since it corresponds to using a mapping function that is the tensor product between the individual mapping functions where . The product of positive-definite kernels is positive definite, and taking the product over all dimensions returns a positive-definite kernel over the joint space: .

*i*th dimension to the power . As , the influence of the

*i*th dimension decreases, and , thereby removing its effect altogether. Taking the product over all dimensions results in a weighted product kernel over the joint space: where denotes the nonnegative parameter vector. However, not all positive-definite kernels can be taken to an arbitrary power and still be positive definite. Only the class of positive-definite kernels that are infinitely divisible (Horn, 1967) can be taken to arbitrary powers such that the resulting kernel is positive definite.

There are many infinitely divisible kernels, but our interest in metric learning leads us to the special case of kernels that are functions of distances . Here we rely on the work of Schoenberg (1938), who explored the connection between distance metrics and positive-definite kernel functions: a kernel that is a function of distance metric is positive definite only if the metric space can be isometrically embedded in Hilbert space. From Schoenberg (1938), theorem 4, the most general function *f*(*u*), which is bound away from zero and whose positive powers are positive definite, is of the form , where is positive definite and *c* is a constant. For kernels of this form, positive powers scale the constant and function .

Thus, a class of kernels whose positive powers are all positive definite are of the form , where is positive definite. Given a metric, , is positive definite for only certain choices of . In particular, if corresponds to a *p*-norm or an metric, then is positive definite for (Schoenberg, 1938). Furthermore, the kernel is positive definite for and (Schoenberg, 1938). For *p*<1, *d _{p}* is not actually a metric since it violates the triangle inequality; nonetheless, is embeddable in a vector space for .

*Q*gaussian kernels: where .

*Q*product kernels: Let denote the implicit mapping defined by the sum kernel. This mapping defines a metric between

*x*and that corresponds to the distance in the Hilbert space,

In terms of a group of samples, the power of the distance matrix for the *i*th dimension is denoted where . The notation denotes that each element is squared where denotes the entry-wise (Hadamard) product as opposed to the matrix product *D*^{2}=*DD*.

The kernel matrix for the *i*th dimension is . The kernel matrix for the product and sum kernels are computed as and . The labels of the trials can also be represented by a kernel matrix *L*, where each entry uses the 0-1 kernel— if and if .

### 2.3. Neural Metrics.

Out of the many possible neural response metrics, we consider the following metrics:

*Temporal metrics for multivariate time series.*Each individual distance is the Euclidean distance between the vectors of instantaneous amplitudes across the channels. Each weight corresponds to a particular time lag. The weight adjusts the importance of the distance between the spatial patterns of the two samples at that particular time.*Spatiotemporal projections.*A linear projection matrix is used to form a Mahalanobis distance.*Spike train metrics.*Each individual distance is between spike trains on the same unit at different temporal precisions. The weight adjusts the importance of each unit at a particular temporal precision value. There are a number of spike train metrics; we consider two:

- •
*Spike train alignment metric.*The metric is the or version of the Victor-Purpura (VP) spike train distance (Dubbs et al., 2010; Victor & Purpura, 1996). - •
*Kernel-based spike metric.*The metric is defined by the mapping induced by a spike train kernel (Park et al., 2012, 2013). We use the memoryless cross-intensity (mCI) spike train kernel (Paiva et al., 2009). Let and be two sets of spike times; the kernel is defined as Then is an metric (Paiva et al., 2009).

Based on these metrics, we use kernel functions as measures of similarity. On each individual dimension, we use either the gaussian kernel for the Euclidean and distances or the Laplacian for metrics such as the original Victor-Purpura metric. The kernels for individual dimensions are combined using the tensor product kernel, equation 2.5. The sum of product kernels, equation 2.7, consists of an unweighted sum of the weighted product kernels with different weightings. For the Mahalanobis metric, equation 2.2, a multivariate gaussian kernel, equation 2.6, is used.

## 3. Kernel-Based Metric Learing

We introduce a kernel-based measure to quantify the joint information between neural responses and labels corresponding to the stimulus or condition. The measures can be used as an objective function to optimize the metric used to evaluate the similarity among neural responses.

### 3.1. Kernel-Based Measures of Dependence.

Kernel target alignment measures the similarity between two kernel functions using their normalized inner product (Cristianini, Shawe-Taylor, Elisseeff, & Kandola, 2002). For jointly sampled data, the inner product of kernel functions defines a measure of dependence between random variables (Gretton, Bousquet, Smola, & Schölkopf, 2005). Unlike Pearson's correlation coefficient, which uses the values of the random variables, kernel-based dependence assesses the degree to which the similarity of example pairs, as defined by each kernel function, matches or aligns. In terms of distance-based kernel functions, the dependence could be posed as, “Do nearby examples, as defined by the first random variable, correspond to nearby examples in the second random variable?”

*z*=(

*x*,

*y*) represents a joint sample of and and depend on only

*x*and

*y*, respectively: and . The marginal behavior of the kernels can be expressed in terms of their mapping functions: Then: is a measure of statistical dependence between

*x*and

*y*, since it is higher when similar pairs of one variable correspond to similar pairs in the other variable. However, the measure performs poorly in practice without centering the kernels first (Cortes et al., 2012).

*x*and

*y*.

*K*and

*L*where and : where and are the centered kernel matrices. The centered kernel is computed as Using matrix multiplication, , where is the empirical centering matrix,

*I*is the identity matrix, and

**1**is a vector of ones. The computational complexity of centered alignment between two kernel matrices is .

### 3.2. Metric Learning Optimization.

Our objective for metric learning is to maximize the dependence between the neural data representation and the class label. Centered alignment is used to evaluate the dependence in terms of the kernel representations. The 0-1 kernel on the labels is fixed, and the parameters of a metric-based kernel defined in section 2.2 are optimized in order to maximize the centered alignment.

For convenience, we use the logarithm of the centered alignment as the objective. With or without the logarithm, the kernel-based objective is a nonlinear function of the parameters, and we propose using approximate inverse Hessian and stochastic gradient methods for optimization. We detail the gradients below.

*L*is the 0-1 kernel matrix of the labels,

*H*is the centering matrix, and

*k*is a constant that does not depend on . The gradient of the objective function with respect to the kernel matrix is The gradient kernel with respect to the kernel parameter is . Then the gradient of the objective is The nonnegativity constraint on can be removed by performing the optimization in terms of

*u*where . The gradients can be made in terms of unconstrained optimization variables

*u*by . This yields an unconstrained optimization.

_{i}*A*is where

*X*is a matrix of the data;

*G*is the gradient of the objective function with respect to the kernel matrix, equation 3.13; and

**1**is a vector of ones.

For the approximate inverse Hessian optimization, we use (Schmidt, 2012), using the default limited memory BFGS update. For the sum and product kernels, prior to optimization the individual matrices are all normalized such that the average across all elements is 1. For the product kernel, all the weights are initialized to be 10^{−3}, and for the sum of product kernels, they are uniformly distributed in . For the Mahalanobis distance, the optimization of *A* yields varying results depending on the initial value of *A*, but using the projection from Fisher discriminant analysis for initialization performs well in practice.

As an alternative optimization that can handle large sample sizes, we use a stochastic gradient over small batches. Specifically, we use a paradigm commonly used in feature selection: at each iteration, one example is sampled and then a prespecified number of examples of the same class and from differing classes are sampled to form the batch (Kira & Rendell, 1992). For each batch, the weights are updated based on the gradient of the objective. Very small batches—even just four examples—are sufficient for learning the parameters of the product kernel, but to learn a Mahalanobis distance, we found that large batches, in the hundreds, are necessary.

## 4. Benchmark Comparison

Using publicly available data sets, we compare centered alignment metric learning to optimize a weighted metric to a feature-weighting method (Takeuchi & Sugiyama, 2011). The feature weighting is explicitly optimized to improve the *k*-nearest-neighbor classification; this serves as a benchmark for centered alignment metric learning, which is not tuned to any particular classifier. The method was shown to consistently outperform other feature-weighting methods. For a valid comparison, the specifics of the benchmark comparison by Takeuchi and Sugiyama (2011) are replicated; we use the same UCI machine learning data sets (Bache & Lichman, 2013; Cortez, Cerdeira, Almeida, Matos, & Reis, 2009; Little et al., 2007) and classification scenario (one-third for training, one-third to choose *k*, number of nearest neighbors, through cross-validation; and one-third for testing). However, we increase the Monte Carlo divisions to 200 for statistical comparison. As a sanity check, Euclidean distance after normalizing the variance of the features is also used. We tested both the L-BFGS optimization and the minibatch with 4 sample batches, 10,000 batches, and a step size of 0.01. We did not rerun the sequential quadratic program-based feature weighting (Takeuchi & Sugiyama, 2011), but instead list the value they report for the mean error rate across 10 Monte Carlos divisions.

The results are displayed in Table 1. On these small-scale problems—maximum dimension is 57—none of the compared methods consistently outperforms the best. Considering the best of the two proposed optimization methods, centered alignment metric learning performs best on half of the data sets.

Data Set . | |C|
. | P
. | n
. | FW . | Euclidean . | CAML . | CAML . |
---|---|---|---|---|---|---|---|

Pen-Based Recognition | 10 | 16 | 10992 | 1.1 | 1.00.2 | N/A | 1.20.2 |

Breast Wisc. (Diag.) | 2 | 30 | 569 | 4.0 | 4.81.7 | 4.41.5 | 4.31.4 |

Page Blocks | 5 | 10 | 5473 | 4.6 | 4.10.5 | 4.60.5 | 4.30.5 |

Image Segmentation | 7 | 18 | 2310 | 5.2 | 6.21.0 | 3.30.8 | 4.60.8 |

Ionosphere | 2 | 33 | 351 | 12.2 | 16.33.9 | 10.74.9 | 13.73.5 |

Parkinsons | 2 | 22 | 195 | 10.2 | 12.14.3 | 13.64.7 | 11.53.9 |

Spambase | 2 | 57 | 4601 | 10.4 | 11.00.9 | 14.64.7 | 10.40.8 |

Waveform (ver. 1) | 3 | 21 | 5000 | 18.4 | 19.00.9 | 17.90.9 | 18.50.8 |

Connectionist (Sonar) | 2 | 60 | 208 | 22.1 | 20.85.4 | 27.54.9 | 22.45.4 |

Wine Quality | 7 | 11 | 6497 | 46.3 | 46.31.0 | 48.91.2 | 46.01.0 |

Data Set . | |C|
. | P
. | n
. | FW . | Euclidean . | CAML . | CAML . |
---|---|---|---|---|---|---|---|

Pen-Based Recognition | 10 | 16 | 10992 | 1.1 | 1.00.2 | N/A | 1.20.2 |

Breast Wisc. (Diag.) | 2 | 30 | 569 | 4.0 | 4.81.7 | 4.41.5 | 4.31.4 |

Page Blocks | 5 | 10 | 5473 | 4.6 | 4.10.5 | 4.60.5 | 4.30.5 |

Image Segmentation | 7 | 18 | 2310 | 5.2 | 6.21.0 | 3.30.8 | 4.60.8 |

Ionosphere | 2 | 33 | 351 | 12.2 | 16.33.9 | 10.74.9 | 13.73.5 |

Parkinsons | 2 | 22 | 195 | 10.2 | 12.14.3 | 13.64.7 | 11.53.9 |

Spambase | 2 | 57 | 4601 | 10.4 | 11.00.9 | 14.64.7 | 10.40.8 |

Waveform (ver. 1) | 3 | 21 | 5000 | 18.4 | 19.00.9 | 17.90.9 | 18.50.8 |

Connectionist (Sonar) | 2 | 60 | 208 | 22.1 | 20.85.4 | 27.54.9 | 22.45.4 |

Wine Quality | 7 | 11 | 6497 | 46.3 | 46.31.0 | 48.91.2 | 46.01.0 |

Notes: Columns denoted |*C*|, *P*, and *n* indicate the number of classes, features, and samples, respectively. The sequential quadratic program-based feature weighting (FW) (Takeuchi & Sugiyama, 2011) is compared to an unweighted normalized Euclidean distance (Euclid.) and centered alignment metric learning (CAML) to optimize a product kernel. The mini-batch approximation is indicated as (CAML). Numbers in bold indicate (the best-performing methods, which were not significantly different *p*>0.05 for either a one-sample *t*-test versus FW or a two-sample *t*-test.)

## 5. Data Collection

All animal procedures were approved by the SUNY Downstate Medical Center IACUC and conformed to National Institutes of Health guidelines. Cortical local field potentials and action potentials were recorded during natural tactile stimulation of forepaw digits and palm of four female Long-Evans rats under anesthesia.^{1} After induction using isoflurane, urethane was used to maintain anesthetic depth. A 32-channel microelectrode array (Blackrock Microsystems) was inserted into the hand region of the primary somatosensory cortex (S1). The array was arranged in a 6×6 grid (excluding the four corners) with 400 m spacing between neighboring electrodes. Another array was inserted into the VPL region of the thalamus, but the signals are not used here.

The right forepaw was touched using a motorized probe 225 times at up to 9 sites—4 digits and 5 sites on the palm. For each touch site, the probe was positioned 4 mm above the surface of the skin and momentarily pressed down for 150 ms (see Figure 1). This was repeated 25 times at random intervals. The 4 data sets had 3, 8, 9, and 9 touch sites resulting in 75, 200, 225, and 225 samples, respectively.

The LFPs were bandpass filtered with cutoffs (5 Hz, 300 Hz) and sampled at a rate of 1220.7 Hz. Then the LFPs were digitally filtered using a third-order Butterworth high-pass filter with cutoff of 4 Hz and notch filters at 60 Hz and harmonics. For analysis, the neural response in a 270 ms window following each touch onset was used, which corresponds to 330 discrete time samples. For 32 channels, this results in dimensions.

Across the four data sets, automatic spike sorting selected 95, 64, 64, and 38 multineuron units from the 32 channels. Of these, only 68, 62, 36, and 24 units were used, whose average firing rate was below 30 Hz in the 270 ms window following touch onset.

## 6. Results

We explored centered alignment metric learning (CAML) for both spike trains and local field potentials (LFPs) using the cases listed in section 2.3. For LFPs and binned spike trains, we compared with multiclass Fisher discriminant analysis (FDA) (Fukunaga, 1990) and large-margin nearest neighbor (LMNN) (Weinberger, Blitzer, & Saul, 2006; Weinberger & Saul, 2009). For the linear projections, PCA was used to reduce the dimensionality to half of the number of samples in the data set. The FDA solution is the set of eigenvectors corresponding to a generalized eigenvalue problem. The dimensions can be chosen as the maximum number of nonzero eigenvalues, which is one fewer than the number of classes (Fukunaga, 1990). The FDA solution was used as the initial projection for LMNN and CAML Mahalanobis metric. An efficient Matlab implementation of LMNN is publicly available, and besides the initialization (which greatly increased the performance), default parameters were used.

To compare classification performance, 20 Monte Carlo divisions of the data sets into training and testing sets were made. For training, two-thirds of the samples in each class were used; the remainder of the samples were used in testing. On each Monte Carlo run, the metrics were optimized on the training set. Testing set samples were labeled by either a one-nearest-neighbor (1-NN) or a support vector machine (SVM) classifier. SVM training and testing was performed using the (ver. 3.17) (Chang & Lin, 2011) implementation with the user-provided kernel matrix. The regularization parameter was chosen through five-fold cross-validation. For CAML, the kernel is directly optimized as part of the metric learning, but for FDA, LMNN, and the unweighted metrics, a gaussian kernel was used with the kernel size chosen from a discrete set using five-fold cross-validation.

The set of the highest-performing methods for each data set was found by selecting the best-performing method and finding those that were not significantly different using a two-sample Welch test with significance of 0.05.

### 6.1. Decoding Touch Location from Spike Trains.

Multiunit spike train metrics using the single-unit Victor-Purpura (VP) and kernel-based (mCI) metrics were optimized for touch location classification. For each unit, the distance is computed with different values for the temporal precision value *q* (higher values of *q* require more precise alignment): for the Victor-Purpura distance, the set (0.01, 0.1, 1.0) s^{−1} was used, and for the spike train kernel-based metrics (mCI), the set (10^{−9}, 0.01, 0.1, 1, 10, 100) s^{−1} was used. For the Victor-Purpura distance, the version (Dubbs et al., 2010) was used.^{2} The classification rates for the weighted spike train metrics are in Table 2. With the CAML-optimized product kernel, the average classification rate increased by at least 8 percentage points for both metrics and both classifiers. For the sum kernel with *Q*=5, the accuracy was further increased.

VP . | mCI . | CAML VP . | CAML mCI . | ||||||
---|---|---|---|---|---|---|---|---|---|

Unweighted . | Unweighted . | . | . | . | . | . | . | ||

1-NN . | SVM . | 1-NN . | SVM . | 1-NN . | SVM . | SVM . | 1-NN . | SVM . | SVM . |

539 | 699 | 609 | 598 | 866 | 875 | 859 | 854 | 906 | 925 |

354 | 805 | 705 | 785 | 775 | 875 | 914 | 784 | 875 | 893 |

285 | 504 | 434 | 506 | 446 | 534 | 595 | 485 | 586 | 614 |

224 | 286 | 254 | 275 | 225 | 385 | 385 | 224 | 299 | 344 |

34.6 | 56.9 | 49.2 | 53.7 | 57.2 | 66.1 | 68.3 | 58.5 | 66.0 | 69.0 |

VP . | mCI . | CAML VP . | CAML mCI . | ||||||
---|---|---|---|---|---|---|---|---|---|

Unweighted . | Unweighted . | . | . | . | . | . | . | ||

1-NN . | SVM . | 1-NN . | SVM . | 1-NN . | SVM . | SVM . | 1-NN . | SVM . | SVM . |

539 | 699 | 609 | 598 | 866 | 875 | 859 | 854 | 906 | 925 |

354 | 805 | 705 | 785 | 775 | 875 | 914 | 784 | 875 | 893 |

285 | 504 | 434 | 506 | 446 | 534 | 595 | 485 | 586 | 614 |

224 | 286 | 254 | 275 | 225 | 385 | 385 | 224 | 299 | 344 |

34.6 | 56.9 | 49.2 | 53.7 | 57.2 | 66.1 | 68.3 | 58.5 | 66.0 | 69.0 |

Notes: Victor-Purpura (VP) or kernel-based (mCI) metrics were used in unweighted combinations versus using centered alignment metric learning (CAML) to optimize a product kernel () or sum of weighted product kernels () with *Q*=5. Nearest-neighbor (1-NN) and support vector machine (SVM) were used as classifiers. Numbers in bold indicate the methods with the highest accuracy with or without binning see Table 3.

For binned spike trains, a Mahalanobis metric was optimized using FDA, CAML, and LMNN. The results across a range of different bin sizes are shown in Figure 2. On three data sets, the best binned metrics performed worse than the optimized spike train metrics. For each data set and method, the performance using the bin size with the highest average accuracy is shown in Table 3. On three of the data sets, the Mahalanobis metric optimized with CAML tied or outperformed the FDA solution, and on all data sets, using LMNN decreased performance.

Euclidean . | A-FDA
. | A-CA
. | A-LMNN
. | ||||
---|---|---|---|---|---|---|---|

1-NN . | SVM . | 1-NN . | SVM . | 1-NN . | SVM . | 1-NN . | SVM . |

5810 | 739 | 918 | 9214 | 929 | 9210 | 9112 | 8914 |

666 | 767 | 808 | 809 | 826 | 827 | 778 | 7810 |

425 | 468 | 476 | 516 | 516 | 517 | 467 | 4810 |

245 | 286 | 306 | 347 | 296 | 296 | 296 | 316 |

47.4 | 55.6 | 61.9 | 64.2 | 63.5 | 63.5 | 60.7 | 61.4 |

Euclidean . | A-FDA
. | A-CA
. | A-LMNN
. | ||||
---|---|---|---|---|---|---|---|

1-NN . | SVM . | 1-NN . | SVM . | 1-NN . | SVM . | 1-NN . | SVM . |

5810 | 739 | 918 | 9214 | 929 | 9210 | 9112 | 8914 |

666 | 767 | 808 | 809 | 826 | 827 | 778 | 7810 |

425 | 468 | 476 | 516 | 516 | 517 | 467 | 4810 |

245 | 286 | 306 | 347 | 296 | 296 | 296 | 316 |

47.4 | 55.6 | 61.9 | 64.2 | 63.5 | 63.5 | 60.7 | 61.4 |

Notes: Mahalanobis-based metrics (parameterized by matrix *A*) were optimized using Fisher discriminant analysis (FDA), centered alignment (CA), and large margin nearest neighbor (LMNN). For each data set and method, the bin size with the maximum performance was selected. Numbers in bold indicate the highest-performing methods for each data set with or without binning. See Table 2.

We used classical multidimensional scaling (MDS) (Torgerson, 1952) and *t*-distributed stochastic neighborhood embedding (t-SNE) (van der Maaten & Hinton, 2008) to find a two-dimensional embedding of the distance matrices before and after training the metric. The embedding is formed without knowledge of the class labels. From Figure 3, it is clear that metric learning with the product kernel increases distances among the different classes while decreasing the distances among samples within the same class. In Figure 4A, we show how the optimized spike train metric can be used to identify both the temporal precision and spiking units most useful for the decoding task.

### 6.2. Decoding Touch Location from LFPs.

The classification rates for learning spatiotemporal metrics on LFP are tabulated in Table 4. Using CAML to optimize just a single temporal weighting improves the accuracy by 21 and 14.7 percentage points for 1-NN and SVM, respectively. Using a sum kernel composed of *Q*=5 product kernels further increased performance by 2.2 and 4 percentage points. The optimized weights for a single product kernel are shown in Figure 4B. Overall, using FDA to optimize a linear projection was able to achieve the highest classification rates with average improvement over Euclidean distance by 33.5 and 23.4 percentage points for 1-NN and SVM.

Euclidean . | -CAML . | -CAML . | A-CA
. | A-FDA
. | A-LMNN
. |
---|---|---|---|---|---|

1-NN | |||||

854.6 | 923.6 | 944.0 | 982.4 | 981.9 | 971.7 |

604.5 | 784.4 | 824.4 | 952.6 | 971.7 | 944.4 |

453.5 | 745.0 | 784.4 | 915.0 | 913.5 | 884.4 |

494.1 | 784.3 | 785.2 | 853.4 | 864.0 | 805.9 |

59.6 | 80.6 | 82.8 | 92.4 | 93.1 | 89.7 |

SVM | |||||

894.4 | 943.7 | 972.2 | 982.0 | 982.0 | 972.4 |

755.0 | 843.0 | 892.4 | 952.6 | 971.9 | 945.0 |

615.4 | 794.9 | 834.0 | 914.7 | 923.5 | 872.9 |

545.1 | 814.2 | 853.8 | 863.8 | 864.5 | 805.9 |

69.8 | 84.5 | 88.5 | 92.4 | 93.2 | 89.3 |

Euclidean . | -CAML . | -CAML . | A-CA
. | A-FDA
. | A-LMNN
. |
---|---|---|---|---|---|

1-NN | |||||

854.6 | 923.6 | 944.0 | 982.4 | 981.9 | 971.7 |

604.5 | 784.4 | 824.4 | 952.6 | 971.7 | 944.4 |

453.5 | 745.0 | 784.4 | 915.0 | 913.5 | 884.4 |

494.1 | 784.3 | 785.2 | 853.4 | 864.0 | 805.9 |

59.6 | 80.6 | 82.8 | 92.4 | 93.1 | 89.7 |

SVM | |||||

894.4 | 943.7 | 972.2 | 982.0 | 982.0 | 972.4 |

755.0 | 843.0 | 892.4 | 952.6 | 971.9 | 945.0 |

615.4 | 794.9 | 834.0 | 914.7 | 923.5 | 872.9 |

545.1 | 814.2 | 853.8 | 863.8 | 864.5 | 805.9 |

69.8 | 84.5 | 88.5 | 92.4 | 93.2 | 89.3 |

Notes: Centered alignment metric learning (CAML) was used to optimize a single temporal weight vector or a sum kernel with multiple temporal weightings versus Mahalanobis-based metrics (parameterized by a matrix *A*) optimized using Fisher discriminant analysis (FDA), centered alignment (CA), and large margin nearest neighbor (LMNN). Numbers in bold indicate the highest-performing methods for each data set.

Finally, a multiscale metric was learned as the weighted combination of the optimized spike distance and optimized LFP distance. On these data sets, the combination of spiking and LFPs did not increase the classification rate versus using only LFPs, and the weight assigned to the spiking metric was insignificant compared to the weight assigned to the LFP metric. The average classification accuracy across the data sets was slightly lower than using just the LFP metric.

## 7. Discussion

### 7.1. Metric Learning for Neural Decoding.

From the results, it is clear that metric learning achieves three goals: increases the decoding accuracy, identifies important dimensions of the neural response, and improves the ability of manifold learning techniques to visualize the data in a low-dimensional space.

For spike trains, the average performance of the optimized multiunit spike train metrics exceeded those based on binning. To our knowledge, this is the first work on optimizing a multineuron metric that is nonparametric and does not require binning. In the framework of the kernel-based dependence measure, this optimization explicitly optimizes the contribution of each dimension using tensor product kernels for multineuron spike trains.

On all data sets, the performance from using the unweighted multineuron spike train metrics was lower than using the optimized Mahalanobis metrics on the binned representation. In essence, a simple linear projection of binned spike trains performs better than binless metrics that are not optimized. The precision offered by binless methods is realized only after optimization. This highlights the importance of metric learning versus naively combining the single-unit metrics.

FDA achieved the best performance for this static decoding task using the time-locked evoked LFPs. FDA is well suited for this setting since the class-conditional LFP responses are approximately normally distributed, an underlying assumption for FDA. In addition, the FDA solution is also the fastest solution; consequently, FDA should always be the baseline for discrete decoding of sensory-evoked LFPs. Alternatively, for binned spikes, using CAML to further optimize the FDA projection marginally increased the classification performance. Overall, FDA and CAML outperformed LMNN in optimizing a Mahalanobis metric.

One drawback of the Mahalanobis metric is the ability to analyze the projection matrices themselves; that is, it is difficult to match and compare linear projections learned across multiple subjects or tasks, especially for high-rank projections. In this case, using a weighted metric, which has lower accuracy but far fewer parameters, is more easily interpretable. From Figure 4B, it is clear that the weighted metrics can be used to identify dimensions—in this case, time lags—that are useful for discrimination. In addition, it appears that the optimization leads to a very sparse set of weights.

In terms of neural decoding, we compared the classification rate, as a proxy for the information content, of the neural responses. We have also highlighted how changing the underlying metric of the neural response space can improve the visualization results from unsupervised manifold learning algorithms. Indeed from Figure 3, a user can immediately judge which classes are more similar or indistinguishable. The nonlinear embeddings preserve some features of the stimulus space's topology, for example, the separation between digit responses and palm responses in data set 3 and the preservation of the relative arrangement of the three touch sites in data set 2.

### 7.2. Metric Learning for Neural Encoding.

We have concentrated on the problem of neural decoding, but the proposed algorithms are also applicable to the neural encoding problem, wherein the roles of the stimulus and neural response are reversed. More specifically, for neural encoding, the metric on the neural response is fixed, the neural activity (e.g., the spiking of a single neuron) is treated as the target or label variable, and a metric on the stimulus is adjusted. For instance, if the neuron is assumed to be a simple cell with a linear receptive field, then learning the receptive field is equivalent to learning a Mahalanobis distance on the stimulus space.

The ideas that have been developed for metric-learning and supervised dimensionality reduction in the machine learning community are fundamentally similar to the algorithms for inferring the linear receptive fields of neurons in the computational neuroscience community, but the nomenclature and domain has differentiated them. Recently researchers have begun to bridge this gap using kernel-based measures of dependence (Sinz, Stockl, Grewe, & Benda, 2013). To highlight this connection, we replicated the experiments by Sharpee, Rust, and Bialek (2004), but instead of using the maximally informative directions algorithm, we used centered alignment metric learning, with the minibatch optimization. To save space, the experimental detail and results are posted online.^{3}

Interestingly, most neural encoding models have concentrated on linear projections corresponding to Mahalanobis-based distances, whereas recent work has shown that the stimulus metric corresponding to a neural population can be highly non-Euclidean (Tkačik, Granot-Atedgi, Segev, & Schneidman, 2013). Thus, future work can investigate how non-Euclidean metrics can be learned. Additionally, the joint optimization of the metrics on both the neural response and the stimulus is worth investigating in future work.

### 7.3. Kernel-Based Metric Learning.

In the kernel-learning framework, we have proposed using a weighted product kernel, where adjusting the weights changes the underlying kernel and associated Hilbert space. This leads to nonconvex optimization, which we solve using first-order methods. In addition, the sum of weighted product kernels uses a different set of weights for each product kernel and achieves increased performance. This formulation is distinct from the multiple kernel learning (Lanckriet et al., 2004; Cortes et al., 2012; Yamada et al., 2013), where there is an explicit weight associated with each kernel in the sum and each summand kernel is chosen a priori (i.e., in the case of gaussian kernel, the kernel size is not optimized and must be preselected). The main benefit of the multiple kernel learning is that a convex optimization problem can be posed to optimize the weights. Alternatively, the weighted product kernels and sum of weighted product kernels constitute a much richer family of kernel functions than the weighted sum of kernels. We need only to select the number of summand kernels; fully exploring how to choose this number is left for future work.

Linear projections and weighted metrics are two special cases of metric learning that have received the most study. Indeed, the weighted combination of metrics was used in some of the earliest work on metric learning (Lowe, 1995). We have gone beyond this by using a sum of weighted product kernels, which computes distances in the Hilbert space that correspond to nonlinear transformations of the data samples. The sum of weighed product kernels still has interpretable parameters, quite unlike kernelized projections (Baudat & Anouar, 2000), where the transformations are defined only in terms of the samples instead of the original dimensions.

## 8. Conclusion

We have covered a class of kernels applicable to metric learning and have proposed using a kernel-based dependence measure to train linear projections, weighted combinations of metrics in product kernels, and nonlinear combinations of metrics using the sum of weighted product kernels. These metrics were optimized on a neural data set consisting of both spike trains and local field potentials, for which metric learning improves both nearest-neighbor and SVM classification accuracy over unweighted alternatives. Within the proposed framework, the optimized multiunit spike train metrics, which avoid binning, outperform both unweighted multiunit metrics and metrics optimized for the binned spike trains. In addition, metric learning improves the quality of the visualizations obtained by nonlinear dimensionality reduction; this is useful for analyzing the relationship between high-dimensional neural data and target variables. The optimized weights themselves indicate the relative relevance of different dimensions of the neural response. This can be used to explore how the weights of specific channels or neurons change for different tasks. Overall, optimizing metrics is a worthwhile approach for investigating neural representations.

## Acknowledgments

This work was supported in part by DARPA contract N66001-10-C-2008. We thank Memming Park, Luis Giraldo Sanchez, and Sohan Seth for their helpful discussions and insight on topics that lead to this work. We thank the reviewers for suggestions that have improved this work.

## References

## Notes

^{1}

A subset of these data has been analyzed previously using other methods (Brockmeier et al., 2013).

^{2}

Experiments with the original version (Victor & Purpura, 1996) with a Laplacian kernel were also performed, but there was no significant difference.