Abstract
Cluster analysis faces two problems in high dimensions: the “curse of dimensionality” that can lead to overfitting and poor generalization performance and the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. We describe a solution to these problems, designed for the application of spike sorting for next-generation, high-channel-count neural probes. In this problem, only a small subset of features provides information about the cluster membership of any one data vector, but this informative feature subset is not the same for all data points, rendering classical feature selection ineffective. We introduce a “masked EM” algorithm that allows accurate and time-efficient clustering of up to millions of points in thousands of dimensions. We demonstrate its applicability to synthetic data and to real-world high-channel-count spike sorting data.
1 Introduction
Cluster analysis is a widely used technique for unsupervised classification of data. A
popular method for clustering is fitting a mixture of gaussians, often achieved using the
expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) and variants thereof (Fraley & Raftery, 2002). In high dimensions, however, this method faces two
challenges (Bouveyron & Brunet-Saumard, 2012).
First is the “curse of dimensionality,” which leads to poor classification, particularly in
the presence of a large number of uninformative features; second, for large and
high-dimensional data sets, the computational cost of many algorithms can be impractical.
This is particularly the case where covariance matrices must be estimated, leading to
computations of order , where p is the number of
features; furthermore, even a cost of
can render a
clustering method impractical for applications in which large high-dimensional data sets
must be analyzed daily. In many cases, however, the dimensionality problem is solvable, at
least in principle, as the features sufficient for classification of any particular data
point are a small subset of the total available.
A number of approaches have been suggested for the problem of high-dimensional cluster
analysis. One approach consists of modifying the generative model underlying the mixture of
gaussians fit to enforce low-dimensional models. For example the mixture of factor analyzers
(Ghahramani & Hinton, 1996; McLachlan, Peel,
& Bean, 2003) models the covariance matrix of
each cluster as a low-rank matrix added to a fixed diagonal matrix forming an approximate
model of observation noise. This approach can reduce the number of parameters for each
cluster from to
and may thus
provide a substantial improvement in both computational cost and performance. An alternative
approach, which offers the opportunity to reduce both the number of parameters and
computational cost substantially below
, is feature
selection, whereby a small subset of informative features is selected and other
noninformative features are discarded (Raftery & Dean, 2006). A limitation of global feature selection methods, however, is that they
cannot deal with the case where different data points are defined by different sets of
features. One proposed solution to this consists of assigning each cluster a unique
distribution of weights over all features, which has been applied to the case of
hierarchical clustering (Friedman & Meulman, 2004).
The algorithm described below was developed to solve the problems of high-dimensional cluster analysis for a particular application: spike sorting of neurophysiological recordings using newly developed high-count silicon microelectrodes (Einevoll, Franke, Hagen, Pouzat, & Harris, 2012). Spike sorting is the problem of identifying the firing times of neurons from electric field signatures recorded using multisite microfabricated neural electrodes (Lewicki, 1998). In low-noise systems, such as retinal explants ex vivo, it has been possible to decompose the raw recorded signal into a sum of waveforms, each corresponding to a single action potential (Pillow, Shlens, Chichilnisky, & Simoncelli, 2013; Marre et al., 2012; Prentice et al., 2011). For recordings in the living brain, noise levels are considerably higher, and an approach based on cluster analysis is more often taken. In a typical experiment, this will involve clustering millions of data points, each of which reflects a single action potential waveform that could have been produced by one of many neurons. Historically, neural probes used in vivo have had only a small number of channels (usually four), typically resulting in feature vectors of 12 dimensions, which required sorting into 10 to 15 clusters. Analysis of “ground truth” shows that the data are quite well approximated by a mixture of gaussians with different covariance matrices between clusters (Harris, Henze, Csicsvari, Hirase, & Buzsáki, 2000). Accordingly, in this low-dimensional case, traditional EM-derived algorithms work close to optimally, although specialized rapid implementation software is required to cluster the millions of spikes recorded on a daily basis (Harris, Kadir, & Goodman, 2000–2013). More recent neural probes, however, contain tens to hundreds of channels, spread over large spatial volumes, and probes with thousands are under development. Because different neurons have different spatial locations relative to the electrode array, each action potential is detected on only a small fraction of the total number of channels, but the subset differs between neurons, ruling out a simple global feature selection approach. Furthermore, because spikes produced by simultaneous firing of neurons at different locations must be clustered independently, most features for any one data point are not simply noise, but must be regarded as missing data. Finally, due to the large volumes of data produced by these methods, we require a solution that is capable of clustering millions of data points in reasonably short running time. Although numerous extensions and alternatives to the simple cluster sorting method have been proposed: (Takahashi, Anzai, & Sakurai, 2003; Quian Quiroga, Nadasdy, & Ben-Shaul, 2004; Wood & Black, 2008; Sahani, 1999; Lewicki, 1998; Gasthaus, Wood, Gorur, & Teh, 2008; Calabrese & Paninski, 2011; Ekanadham, Tranchina, & Simoncelli, 2013; Shoham, Fellows, & Normann, 2003; Franke, Natora, Boucsein, Munk, & Obermayer, 2010; Carin et al. 2013), none have been shown to solve the problems created by high-count electrode arrays.
Here we introduce a “masked EM” algorithm to solve the problem of high-dimensional cluster
analysis, as found in the spike-sorting context. The algorithm works in two stages. In the
first stage, a “mask vector” is computed for each data point via a heuristic algorithm,
encoding a weighting of each feature for every data point. This stage may take advantage of
domain-specific knowledge, such as the topological constraint that action potentials occupy
a spatially contiguous set of recording channels. In the case that the majority of masks can
be set to zero, both the number of parameters per cluster and running time can be
substantially below . We note that the masks are assigned to data
points rather than clusters and need be computed only once at the start of the algorithm.
The second stage consists of cluster analysis. This is implemented using a
mixture-of-gaussians EM algorithm, but with every data point replaced by a virtual mixture
of the original feature value and the fixed subthreshold noise distribution weighted by the
masks. The use of this virtual mixture distribution avoids the splitting of clusters due to
arbitrary threshold crossings. At no point is it required to generate samples from the
virtual distribution, as expectations over it can be computed analytically.
2 The Masked EM Algorithm
The mathematical notation used in this article can be found in Table 1.
Dimensions (number of features) | p |
Data | ![]() |
Masks | ![]() |
Cluster label | k |
Total number of clusters | K |
Mixture weight, cluster mean, covariance | ![]() |
Probability density function of the multivariate gaussian distribution | ![]() |
Total number of data points | N |
Number of points for which feature i is masked | ![]() |
Noise mean for feature i | ![]() |
Noise variance for feature i | ![]() |
Virtual features (random variable) | ![]() |
Mean of virtual feature | ![]() |
![]() | ![]() |
Variance of virtual feature | ![]() |
Log likelihood of ![]() | ![]() |
Set of data points assigned to cluster k | ![]() |
Subset of ![]() | ![]() |
Dimensions (number of features) | p |
Data | ![]() |
Masks | ![]() |
Cluster label | k |
Total number of clusters | K |
Mixture weight, cluster mean, covariance | ![]() |
Probability density function of the multivariate gaussian distribution | ![]() |
Total number of data points | N |
Number of points for which feature i is masked | ![]() |
Noise mean for feature i | ![]() |
Noise variance for feature i | ![]() |
Virtual features (random variable) | ![]() |
Mean of virtual feature | ![]() |
![]() | ![]() |
Variance of virtual feature | ![]() |
Log likelihood of ![]() | ![]() |
Set of data points assigned to cluster k | ![]() |
Subset of ![]() | ![]() |
2.1 Stage 1: Mask Generation
The first stage of the algorithm consists of computing a set of mask vectors indicating
which features should be used to classify which data points. Specifically, the outcome of
this algorithm is a set of vectors with
components,
. A value of 1 indicates that feature i is to be used in classifying data point
, a value of 0 indicating it is to be ignored,
and intermediate values corresponding to partial weighting. We refer to features being
used for classification as unmasked and features being ignored as masked (i.e.,
concealed). Masked features are not simply set to zero, but are ignored by replacing them
with a virtual ensemble of points, drawn from a distribution of subthreshold noise.



This strategy smoothly interpolates between a mask of zero for features below a lower threshold and a mask of 1 for features above a higher threshold; such smooth interpolation avoids the artificial creation of discrete clusters when variables cross a fixed boundary. In the case of spike sorting, a slightly more complex procedure is used to derive the masks, which takes advantage of the topological constraint that spikes must be distributed across continuous groups of recording channels (see section 3.2). In practice, we have found that good performance can be obtained for a range of masking parameters, provided the majority of noninformative features have a mask of 0 and that features that are clearly suprathreshold are given a mask of 1 (see section 3.1).




2.2 Stage 2: Clustering





The masked EM algorithm therefore acts as if it were passed an ensemble of points, with each data point replaced by an infinite sample, corresponding to different possibilities for the noisy masked variables. This solves the curse of dimensionality by “disenfranchising” each data point’s masked features, disregarding the value that was actually measured and replacing it by a virtual ensemble that is the same in all cases and thus does not contribute to cluster assignment.
2.3 M-Step













where denotes the set of points within cluster k for which feature i is fully masked. Note that if
all data points in a cluster have feature i masked, then
, the noise mean, and
, the noise variance.
2.4 E-Step








2.5 Penalties

For the classical mixture-of-gaussians fit, the number of parameters is equal to
, where K is the number of clusters and p is number of
features. The elements of the first term in
correspond to the number of free parameters in a
covariance matrix, a p-dimensional mean vector, and a single weight for
each cluster. Finally, 1 is subtracted from the total because of the constraint that the
weights must sum to 1 for a mixture model.
For masked data, the estimation of the number of parameters in the model is more subtle.
Because masked features are replaced by a fixed distribution that does not vary between
data points, the effective degrees of freedom per cluster are much smaller than . We therefore define a cluster penalty for each
cluster C that depends only on the average number of unmasked features
corresponding to that cluster. Specifically, let
be the
effective number of unmasked features for data point n (i.e., sum of the
weights over the features). Define
where the
three terms correspond to the number of free parameters in an
covariance matrix, mean vector of length r, and a mixture weight, respectively.
2.6 Implementation
The algorithm was implemented in custom C++ code, based on previously released open-source software for fast mixture-of-gaussians fitting termed KlustaKwik (Harris et al., 2000–2013). Because we require the algorithm to run in reasonable time on large numbers of high-dimensional data points, several approximations are made to give faster running times without significantly affecting performance. These include not only hard classification but also a heuristic that eliminates the great majority of E-step iterations, a split-and-merge feature that changes the number of clusters dynamically if this increases the penalized likelihood, and an additional uniform distributed mixture component to catch outliers. The software can be downloaded from https://github.com/klusta-team/klustakwik (Harris, Kadir, & Goodman, 2013) .
3 Evaluation
3.1 Mixture of Gaussians
We first demonstrate the efficacy of the masked EM algorithm using a simple data set synthesized from a high-dimensional mixture of gaussians. The data set consisted of 20,000 points in 1000 dimensions, drawn from seven separate clusters. The means were chosen by centering probability density functions of gamma distributions on certain chosen features. All covariance matrices were identical: a Toeplitz matrix with all the diagonal entries 1 and off-diagonal entries that decayed exponentially with distance from the diagonal. Figure 1A shows this data set in pseudocolor format.
Simulated data. (A) Subsampled raw data. (B) Confusion matrix in percentages for
masked EM with ,
, and
BIC penalty (equivalent to 10 × AIC for 20,000 points). (C) Confusion matrix in
percentages for classical EM for an AIC penalty. (D) Confusion matrix in percentages
for classical EM for a penalty of
. (E)
VI metric measure of performance of both algorithms using various values for penalty,
where the black vertical line indicates BIC. (F) The number of clusters obtained for
various values of penalty, where the black vertical line indicates BIC.
Simulated data. (A) Subsampled raw data. (B) Confusion matrix in percentages for
masked EM with ,
, and
BIC penalty (equivalent to 10 × AIC for 20,000 points). (C) Confusion matrix in
percentages for classical EM for an AIC penalty. (D) Confusion matrix in percentages
for classical EM for a penalty of
. (E)
VI metric measure of performance of both algorithms using various values for penalty,
where the black vertical line indicates BIC. (F) The number of clusters obtained for
various values of penalty, where the black vertical line indicates BIC.
Figure 1B shows a confusion matrix generated by the
masked EM algorithm on this data, with the modified BIC penalty and masks defined by
equation 2.1, indicating perfect
performance. By contrast, Figure 1C shows the result
of classical EM, in which many clusters have been erroneously merged; the results for AIC
penalty are shown since using a BIC penalty yielded only a single cluster. To verify that
this is not simply due to an inappropriate choice of penalty, we reran with the penalty
term linearly scaled by various factors. Figure 1D
shows the results of a penalty that gave
more clusters than the ground-truth data. Even in this case, however, the clusters
produced by classical EM contained points from a mixture of the original clusters and
could not be corrected even by manual post hoc cluster merging. To systematically evaluate
the effectiveness of both algorithms, we measured performance using the variation of
information (VI) metric (Meilă, 2003), for which a
value of 0 indicates a perfect clustering. Both algorithms were tested for a variety of
different penalties measured in multiples of AIC (see Figures 1E and 1F). Whereas the
masked EM algorithm was able to achieve a perfect clustering for a large range of
penalties around BIC, the classical EM algorithm produced a poorer value of
(corresponding to the poor result of merging
all the points into a single cluster).
Figure 2 shows how clustering performance depends on
the precise choice of masking parameters and
defined in equation 2.1, using BIC penalty. Good performance did not require
a single precise parameter setting but could be obtained with a range of parameters with
and
. The
results illustrate the benefits of using a double-threshold approach in preference to a
single threshold: performance when
is
markedly worse than when
.
The effect of varying and
in equation 2.1 for the simulated
-dimensional data set with seven clusters. The plot shows a
pseudocolor representation of performance measured using the Meila VI metric for
various values of
and
using BIC penalty.
The effect of varying and
in equation 2.1 for the simulated
-dimensional data set with seven clusters. The plot shows a
pseudocolor representation of performance measured using the Meila VI metric for
various values of
and
using BIC penalty.
Finally, in order to explore in more detail how the classical and masked EM algorithm
deal with increasing dimensionality, we varied the number of features input to the
algorithms. First, we sorted the features in rough order of relevance, according to the
mean value of that feature over all input patterns. Both algorithms were then run on
subsets of the most relevant p features for varying values of p. Performance was quantified with the VI metric (see Figure 3); in order to ensure differences between algorithms
were not simply due to differences in penalty strategy, we also permitted post hoc manual
merging of clusters that were overspilt. With fewer than features,
both algorithms performed badly. For
to
features, both algorithms perform perfectly;
however, as more uninformative features were added, the performance of the classical, but
not masked, EM algorithm started deteriorating. The performance of the masked algorithm
remained good for all dimensionalities tested.
The effect of increasing the number of dimensions on the quality of cluster analysis. The dimensions are added in order of relevance relative to the seven ground-truth clusters of the simulated mixture of gaussians data set.
The effect of increasing the number of dimensions on the quality of cluster analysis. The dimensions are added in order of relevance relative to the seven ground-truth clusters of the simulated mixture of gaussians data set.
3.2 Spike Sorting
To test the performance of the masked EM algorithm on our target application of
high-channel-count spike sorting requires a ground-truth data set. Previous work
established the performance of the classical EM algorithm for low-channel spike sorting
with ground truth obtained by simultaneous recordings of a neuron using not only the
extracellular array, but also an intracellular method using a glass pipette that
unequivocally determined firing times (Harris et al., 2000). Because such dual recordings are not yet available for high-count
electrodes, we created a simulated ground truth we term ``hybrid data sets.'' In this
approach, the mean spike waveform on all channels of a single neuron taken from one
recording (the donor) is digitally added onto a second recording (the acceptor) made with
the same electrode in a different brain. Because the hybrid spikes are linearly added to
the acceptor traces, this simulates the linear addition of neuronal electric fields and
recreates many of the challenges of spike sorting, such as the variability of amplitudes
and waveforms of the hybrid spike between channels, and overlap between the digitally
added hybrid with spikes of other neurons in the acceptor data set (Harris et al., 2000). Furthermore, to simulate the common difficulty
caused by of amplitude variability in bursting neurons, the amplitude of the hybrid spike
was varied randomly between 50% and 100% of its original value. The hybrid data sets we
consider were constructed from recordings in rat cortex kindly provided by Mariano
Belluscio and György Buzsáki, using a 32-channel probe with a zig-zag arrangement of
electrodes and minimum 20 m spacing
between neighboring contacts. Three principal components were taken from each channel,
resulting in
-dimensional feature vectors.
For the spike data, masks were generated using a generalization of equation 2.1 that took into account the topology of the
electrode array. Data were first high-pass-filtered (500 Hz cutoff); then spikes were
detected and masks were generated using a two-threshold flood-fill algorithm. The
flood-fill algorithm finds spatiotemporally connected sets S of samples (where t is time and c is channel number), for which the filtered signal exceeds a lower
threshold
for every point in each set and each set
contains at least one sample where the filtered signal exceeds an upper threshold
. The values of
and
were set as 2 and 4.5 times the standard
deviation of the filtered signal, which was estimated robustly as a scaled median absolute
deviation. For each spike, a mask for channel c was defined as
, where
. Spikes
were resampled and aligned to a mean spike time estimated as
. Finally, feature vectors were extracted from
the resampled filtered spike waveforms by principal component analysis channel by channel.
For each channel, the first three principal components were used to create the feature
vector; hence for a C-channel data set, each spike was given a
-dimensional feature vector. The component of
the mask vector corresponding to each feature was obtained by taking as
computed on the channel from which the feature
was derived. The data set contained 138,572 points of 96 dimensions; running 1500
iterations of the clustering algorithm on it took 10 hours on a single core of a 2.4 GHz
Intel Xeon L5620 computer running Scientific Linux 5.5. (The data we analyzed are publicly
available at https://github.com/klusta-team/hybrid_analysis.)
To evaluate the performance of the masked EM algorithm on this data set, we first
identified the cluster with the highest number of true positive spikes and used it to
compute a false discovery rate, , and a
true positive rate,
, where FP denotes the number of false-positive,
TP the number of true-positive, and FN the number of false-negative spikes. This
performance was compared against a theoretical upper bound obtained by supervised
learning. The upper bound was obtained by using a quadratic support vector machine (SVM)
(Cortes & Vapnik, 1995) trained using the
ground-truth data, with performance evaluated by 20-fold cross-validation. In order to
ensure we estimated maximal performance, the SVM was run over a large range of parameters
such as margin and class weights, as well as including runs in which only features
relevant to hybrid cells were included. The theoretical optimum performance was estimated
as a receiver operating characteristic (ROC) curve by computing the convex hull of false
discovery and true positive rates for all SVM runs.
Figure 4 shows the performance of the masked EM
algorithm and classical EM algorithm on the hybrid data set, set against the theoretical
optimum estimated by the SVM. While the masked EM algorithm performs at close to the
estimated upper bound, the classical EM algorithm is much poorer. To verify that this
poorer performance indeed resulted from a curse of dimensionality, we reran the classical
EM algorithm on only the subset of features that were unmasked for the hybrid spike (9 out
of features). As expected, the upper-bound
performance was poorer in this case, but the classical EM algorithm performed close to the
theoretical upper bound. This indicates that the classical algorithm fails in
high-dimensional settings, whereas the masked EM algorithm performs well.
Performance of the masked and classical EM algorithms for a spike sorting data. Blue and red points indicate performance of multiple runs of the masked and classical EM algorithms with different penalty parameter settings. The cyan curve indicates optimal possible performance, estimated as the convex hull of supervised learning results obtained from a support vector machine with quadric kernel.
Performance of the masked and classical EM algorithms for a spike sorting data. Blue and red points indicate performance of multiple runs of the masked and classical EM algorithms with different penalty parameter settings. The cyan curve indicates optimal possible performance, estimated as the convex hull of supervised learning results obtained from a support vector machine with quadric kernel.
4 Discussion and Conclusion
We have introduced a method for high-dimensional cluster analysis, applicable to the case where a small subset of the features is informative for any data point. Unlike global feature selection methods, both the number and the precise set of unmasked features can vary between different data points. Both the number of free parameters and computational cost scale with the number of unmasked features per data point, rather than the total number of features. This approach was found to give good performance on simulated high-dimensional data and in our target application of neurophysiological spike sorting for large electrode arrays.
A potential caveat of allowing different features to define different clusters is the danger of artificial cluster splitting. If simply a hard threshold were used to decide whether a particular feature should be used for a particular cluster or data point, this could lead to a single cluster being erroneously split in two, according to whether the threshold was exceeded by noisy data. The masked EM algorithm overcomes this problem with two approaches. First, because the masks are not binary but real valued, crossing a threshold such as that in equation 2.1 leads to smooth rather than discontinuous changes in responsibilities; second, because masked features are replaced by a virtual distribution of empirically measured subthreshold data, the assignment of points with masked features is close to that expected for the original subthreshold features. With these approaches in place, we found that erroneous cluster splitting was not a problem in simulation or in our target application.
In this study, we have applied the masking strategy to a single application of unsupervised classification using a hard EM algorithm for a mixture-of-gaussians fitting. However, the same approach may apply more generally whenever only a subset of features is informative for any data point and when the expectation of required quantities over the modeled subthreshold distribution can be analytically computed. Other domains in which this approach may work therefore include not only cluster analysis with soft EM algorithms or different probabilistic models but also model-based supervised classification.
References
Author notes
D.F.M.G. is now with the Department of Electrical and Electronic Engineering, Imperial College London, London U.K.