## Abstract

In recent years, the development of algorithms to detect neuronal spiking activity from two-photon calcium imaging data has received much attention, yet few researchers have examined the metrics used to assess the similarity of detected spike trains with the ground truth. We highlight the limitations of the two most commonly used metrics, the spike train correlation and success rate, and propose an alternative, which we refer to as CosMIC. Rather than operating on the true and estimated spike trains directly, the proposed metric assesses the similarity of the pulse trains obtained from convolution of the spike trains with a smoothing pulse. The pulse width, which is derived from the statistics of the imaging data, reflects the temporal tolerance of the metric. The final metric score is the size of the commonalities of the pulse trains as a fraction of their average size. Viewed through the lens of set theory, CosMIC resembles a continuous Sørensen-Dice coefficient—an index commonly used to assess the similarity of discrete, presence/absence data. We demonstrate the ability of the proposed metric to discriminate the precision and recall of spike train estimates. Unlike the spike train correlation, which appears to reward overestimation, the proposed metric score is maximized when the correct number of spikes have been detected. Furthermore, we show that CosMIC is more sensitive to the temporal precision of estimates than the success rate.

## 1 Introduction

Two-photon calcium imaging has enabled neuronal population activity to be monitored in vivo in behaving animals (Dombeck, Harvey, Tian, Looger, & Tank, 2010; Peron, Freeman, Iyer, Guo, & Svoboda, 2015). Modern microscope design allows neurons to be imaged at subcellular resolution in volumes spanning multiple brain areas (Sofroniew, Flickinger, King, & Svoboda, 2016). Coupled with the current generation of fluorescent indicators (Chen et al., 2013), which have sufficient sensitivity to read out single spikes, this imaging technology has great potential to further our understanding of information processing in the brain.

The fluorescent probe, however, does not directly report spiking activity. Rather, it reads out a relatively reliable indicator of spiking activity—a cell's intracellular calcium concentration—from which spike times must be inferred. A diverse array of techniques have been proposed for this task, including deconvolution approaches (Vogelstein et al., 2010; Friedrich, Zhou, & Paninski, 2017; Pachitariu, Stringer, & Harris, 2017), methods that identify the most likely spike train given a signal model (Vogelstein et al., 2009; Deneux et al., 2016), and approaches that exploit the sparsity of the underlying spike train (Oñativia, Schultz, & Dragotti, 2013). To enable the investigation of neural coding hypotheses, reconstructed spike trains must have sufficient temporal precision for analysis of synchrony between neurons and behavioral variables (Huber et al., 2012) while accurately inferring the rate of spiking activity.

Although the development of spike detection algorithms has received a lot of recent attention, few researchers have examined the metrics used to assess an algorithm's performance. At present, there is no consensus on the best choice of metric. In fact, from our survey, 44% of papers presenting a new method assess its performance using a metric unique to that paper. This inconsistency impedes progress in the field: algorithms are not directly comparable, and, consequently, data collectors cannot easily select the optimal algorithm for a new data set.

The two most commonly used metrics, the spike train correlation (STC) and the success rate, are not well suited to the task. The STC, which is invariant under linear transformations of the inputs, is not able to discriminate the similarity of the rates of two spike trains (Paiva, Park, & Príncipe, 2010). Moreover, the temporal binning that occurs prior to spike train comparison impairs the STC's ability to compare spike train synchrony (Paiva et al., 2010). These limitations suggest that although the STC is a quick and intuitive method, it is not appropriate for assessing an algorithm's spike detection performance. The success rate, which accurately compares spike rates, does not reward increasing temporal precision above a given threshold. Consequently, it is not an appropriate metric for evaluating an algorithm's performance when the end goal is, for example, to investigate the synchrony of activations within a network.

In this letter, we present a metric that can discriminate both the temporal and rate precision of an estimated spike train with respect to the ground-truth spike train. Unlike the STC, we do not bin the spike trains. Rather, spike trains are convolved with a smoothing pulse that allows comparison of spike timing with an implicit tolerance. The similarity between the resulting pulse trains is subsequently assessed. This type of continuous approach is also preferred by metrics assessing the relationship between spike trains from different neurons (van Rossum, 2001; Schreiber, Fellous, Whitmer, Tiesinga, & Sejnowski, 2003). We set the pulse width to reflect the temporal precision that an estimate is able to achieve given the statistics of the data set. As such, the metric is straightforward to implement since there are no parameters to tune. For convenience, we refer to the proposed metric as CosMIC (consistent metric for spike inference from calcium imaging). In the following, we demonstrate CosMIC's ability to discriminate spike train similarity on real and simulated data. We include comparisons against the two most commonly used metrics, the spike train correlation and the success rate, and against two metrics designed to assess similarity between spike trains from different neurons (Victor & Purpura, 1997; van Rossum, 2001).

## 2 Constructing the Metric

In this letter, we present a metric for comparing the similarity of two sets of spikes: a ground-truth set, $S$ = ${tk}k=1K$, and a set of estimates, $S^$ = ${t^k}k=1K^$. Due to limiting factors, such as noise and model mismatch, it is improbable that an estimate will match a true spike with infinite temporal precision. As such, we do not expect that $t^j=tk$ for any $j$ or $k$. Rather, we wish to reward estimates within a reasonable range of accuracy given the limitations of the data. We achieve this by leveraging results from fuzzy set theory (Zimmermann, 2010).

### 2.1 Ancestor Metrics

## 3 Temporal Error Tolerance

The width of the triangular pulse with which the spike trains are convolved reflects the accepted tolerance of an estimated spike's position with respect to the ground truth. To set this width, we calculate a lower bound on the temporal precision of the estimate of one spike, the Cramér-Rao bound (CRB), from the statistics of the data. The CRB reports the lower bound on the mean square error of any unbiased estimator (Kay, 1993). It is therefore useful as a benchmark; an estimator that achieves the CRB should be awarded a relatively high metric score. In section 3.1, we detail the calculation of the CRB. In section 3.2, we outline how we use this bound to determine the pulse width. Then, in section 3.3, we provide practical advice on the calculation of the bound.

### 3.1 Cramér-Rao Bound for Spike Detection

### 3.2 Pulse Width

The CRB can be used as a benchmark for temporal precision of any unbiased estimator. As such, we set the pulse width to ensure that on average, an estimate at the precision of the CRB achieves a relatively high score. We set the benchmark metric score at 0.8, as this represents a relatively high value in the range of the metric, which is between 0 and 1. The importance of this score is not the particular benchmark value—a range of values give similar performance—but rather that it is a reproducible number with a clear interpretation. In this letter, we characterize the discrimination performance of CosMIC with a benchmark value of 0.8, so that its scores can be interpreted when applied to spike inference algorithms on real data. The benchmark value was set lower than the metric's maximum value, 1, so that the score does not saturate when the model assumptions are not ideally satisfied. On real data, the noise may not be stationary ($\sigma $ may vary in time), and so algorithms may appear to outperform the CRB. A benchmark score of 0.8 means that the metric score does not saturate in this scenario.

### 3.3 Implementation

Code to implement the metric can be found at github.com/stephanierey/metric along with a demonstration. In order to use the metric, one must have estimates of the fluorescence signal parameters, ${\alpha ,\gamma ,A,\sigma}$ (see equation 3.1). In the following, we provide some guidance on the estimation of these parameters. Alternative strategies have been suggested by numerous model-based algorithms, whose spike detection procedures use a subset of the above parameters (Vogelstein et al., 2009; Pnevmatikakis, Merel, Pakman, & Paninski, 2013; Pnevmatikakis et al., 2016; Deneux et al., 2016).

The standard deviation of the noise, $\sigma $, can be computed as the sample standard deviation of a portion of the data in which there were no calcium transients. The parameters that determine the speed of the rise and decay of the pulse, $\alpha $ and $\gamma $, are predominantly defined by characteristics of the fluorescent indicator that was used to generate the imaging data. In Table 1, we provide documented values of $\alpha $ and $\gamma $ for four commonly used fluorescent indicators, extracted from the corresponding references: Cal-520 AM (Tada et al., 2014), OGB-1 AM (Lütcke, Gerhard, Zenke, Gerstner, & Helmchen, 2013), and GCaMP6f and GCaMP6s (Chen et al., 2013). These values can be used as a guideline; in practice, they will vary with the indicator expression level, as well as the cell type. We note that the time taken for a calcium transient to rise to its peak and the decay time are functions of both $\alpha $ and $\gamma $; the values presented in Table 1 are thus not easily interpretable in terms of the shape of a calcium transient pulse.

Fluorescent Indicator . | $\alpha (s-1)$ . | $\gamma (s-1)$ . |
---|---|---|

GCaMP6f | 4.88 | 60.97 |

GCaMP6s | 1.26 | 15.16 |

OGB-1 AM | 1.5 | 101.5 |

Cal-520 AM | 3.18 | 34.39 |

Fluorescent Indicator . | $\alpha (s-1)$ . | $\gamma (s-1)$ . |
---|---|---|

GCaMP6f | 4.88 | 60.97 |

GCaMP6s | 1.26 | 15.16 |

OGB-1 AM | 1.5 | 101.5 |

Cal-520 AM | 3.18 | 34.39 |

Notes: To calculate CosMIC's pulse width, the parameters that define the speed of rise and decay of the calcium transient, $\alpha $ and $\gamma $, are required. Here, we provide documented values of these parameters for four commonly used fluorescent indicators.

## 4 Numerical Experiments

To assess the discriminative ability of CosMIC, we simulate true and estimated spike trains in various informative scenarios. We compare CosMIC with the two most commonly used metrics in the spike inference literature, which we define in sections 4.1 and 4.2 for completeness. We also compare against two metrics designed to assess the similarity of spike trains from different neurons. We define the metrics of Victor and Purpura (1997) and van Rossum (2001) in sections 4.3 and 4.4, respectively.

### 4.1 Success Rate

The success rate, which is defined as a function of the true- and false-positive rates or, alternatively, as a function of precision and recall, appears in various forms in the literature. Spike inference performance has been assessed using true- and false-positive rates (Rahmati, Kirmse, Marković, Holthoff, & Kiebel, 2016), precision and recall analysis (Reynolds et al., 2017), and using the complement of the success rate, the error rate (Deneux et al., 2016). We study this class of metrics under the umbrella of the success rate, which we define here.

### 4.2 Spike Train Correlation

The STC takes values in the range [$-$1, 1]. In practice, however, it is rare for a spike detection algorithm to produce an estimate that is negatively correlated with the ground truth (Berens et al., 2017). Moreover, an estimate with maximal negative correlation is equally as informative as one with maximal positive correlation. In this letter, we use the normalized spike train correlation, the absolute value of the STC. This ensures that the range of each metric that we analyze is equivalent (and equal to [0,1]) and that, as a consequence, the distribution of metric values are comparable.

### 4.3 Victor-Purpura Dissimilarity

### 4.4 van Rossum Dissimilarity

## 5 Results

To investigate metric properties, we simulated estimated and ground-truth spike trains and analyzed the metric scores. To mimic the temporal error in spike time estimation, unless otherwise stated, estimates were normally distributed about the true spike times. In the following, we refer to the standard deviation of the normal distribution as the jitter of the estimates.

### 5.1 CosMIC Rewards High Temporal Precision

On simulated data, we investigated the effect of these properties when spike train estimates, rather than single spikes, were evaluated. In particular, we analyzed the metric scores when spike train estimates contained the correct number of spikes but their temporal precision varied. We simulated the ground-truth spike train as a Poisson process with rate 1 Hz over 200 s. The corresponding calcium transient signal was generated assuming a Cal-520 pulse shape (see Table 1) and a sampling rate of 30 Hz. White gaussian noise was added to the calcium transient signal to generate two fluorescence signals, one with low and the other with relatively high noise (see Figure 5B). The corresponding metric pulse widths, as calculated from the CRB, were 33 ms and 78 ms, or 1 and 2.3 sample widths, respectively. Spike train estimates were normally distributed about the true spikes with varying jitter. The metric scores were then calculated for 100 realizations of spike train estimates at each jitter level in both the low- and high-noise settings (see Figures 5C and 5D, respectively).

As the correct number of spikes was always estimated, the level of jitter represented the quality of a spike train estimate in this setting. Ideally, a metric would reliably reward spike train estimates of the same quality with the same score. The STC, however, took a relatively large range of values for estimates of the same jitter (see Figures 5C and 5D), despite having the same range as CosMIC and the success rate. This inconsistency is a consequence of the edge effects introduced by binning. Here, we use the term *consistency* in line with its semantic rather than mathematical definition.

We observed a roughly linear trend in the scores of CosMIC and the success rate (see Figure 5E). As expected, CosMIC was boosted with respect to the success rate when the root mean square error (RMSE) of detected spikes was relatively low when measured as a fraction of the pulse width. In each case, the RMSE was computed empirically from the estimated spikes within the precision of CosMIC and the success rate's pulse width. Conversely, CosMIC was relatively low with respect to the success rate when the RMSE was relatively high. This trend is visible in the Bland-Altman plot (Altman & Bland, 1983; Giavarina, 2015), in which the mean of the two methods is plotted against the difference. We conclude that CosMIC is more sensitive to the temporal precision of detected spikes, as, unlike the success rate, it discriminates precision above the bin width.

### 5.2 CosMIC Penalizes Overestimation

It is the type of normalization used by the STC that caused it to be insensitive to overestimation. Scaling factors present in the spike count vectors cancel out in the numerator and denominator (see equation 4.2), rendering the STC invariant under scalar transformations of the inputs. When the STC was adapted to the continuous-time assessment of spike train similarity, by first convolving spike trains with a smoothing pulse, this flaw persisted (Paiva et al., 2010).

When the spike train estimates have jitter $\sigma CRB$ and their rate increases from perfect rate estimation to an overestimation ratio of 3, the success rate and CosMIC scores are reduced by 49% and 40%, respectively. Both metrics are thus penalizing overestimation, with the former metric doing so more harshly. When the jitter is larger than the CRB, the reduction in CosMIC from perfect rate estimation to overestimation is relatively smaller, as CosMIC is already substantially penalizing the temporal discrepancy.

### 5.3 Application to Real Imaging Data

As detailed in section 3, the metric's pulse width was set with respect to the CRB. On this data set, the pulse widths were concentrated between 1 and 3 sample widths; this range encompassed 92% of the data (see Figure 7F). As the noise level of the data increases, so does the pulse width. Consequently, the tolerance of the metric with regard to the temporal precision of estimates also increases. As a result, estimates on noisier data (see Figure 7B) were scored with more lenience than those on less noisy data (see Figure 7A).

As was found on simulated data in section 5.1, there was a linear trend between the scores of CosMIC and the success rate (see Figure 7C). CosMIC was relatively high with respect to the success rate when the temporal precision, represented by RMSE as a fraction of the pulse width, was relatively high. Conversely, CosMIC was low with respect to the success rate when the temporal precision was relatively low. This pattern was conserved when CosMIC's ancestor metrics, $PCosMIC$ and $RCosMIC$ (see section 2.1), were compared to the precision and recall (see Figures 7D and 7E). The average RMSE over all traces was 27 ms, or 0.37 sample widths. As CosMIC is able to discriminate precision above the pulse width, it is better able to reward this superresolution performance than the success rate or STC.

### 5.4 CosMIC Discriminates Precision and Recall of Spike Trains

By construction, CosMIC bears a strong resemblance to the Sørensen-Dice coefficient, which, in the context of spike detection, is referred to as the success rate. The success rate is the harmonic mean of the precision and recall, two intuitive metrics that represent the proportion of estimates that detect a ground-truth spike and the proportion of true spikes detected, respectively. In this section, we demonstrate that CosMIC can accurately discriminate both the precision and recall of spike train estimates.

### 5.5 Comparison with Victor-Purpura and van Rossum Distances

The Victor-Purpura (VP) and van Rossum (vR) spike distances were originally designed to quantify the dissimilarity between spike trains from different neurons (Victor & Purpura, 1997; van Rossum, 2001). Due to the obvious parallels between that scenario and ours, we investigated the applicability of the VP and vR metrics to scoring spike inference.

Although it is already clear that when the width is set correctly, VP and vR can discriminate the rate and temporal precision of spike trains with respect to one another (Paiva et al., 2010), it is not clear whether they are suitable for scoring spike train estimates. In Figures 9B to 9D, we plot the scores of VP, vR, and CosMIC, respectively, as the precision and recall of spike train estimates vary. We observed that vR and VP were less sensitive to the recall than the precision of spike train estimates; relatively low distances were obtained when only 50% of true spikes were detected. In contrast, CosMIC attained a relatively high score only when both the precision and recall were high (see Figure 9D). As it is crucial that a spike inference metric penalizes both undetected and falsely detected spikes, this result suggests that, without modification, VP and vR are not ideal for scoring spike train estimates.

The results correspond to a ground-truth spike train consisting of 200 spikes generated from a Poisson process with rate 1 Hz. False positives were uniformly distributed about the temporal interval, whereas true positives were normally distributed about true spikes with jitter 20 ms. The pulse width was set assuming a CRB of 20 ms. At each level of precision and recall, results were averaged over 100 realizations of spike train estimates.

## 6 Discussion

Much recent attention has been focused on the development of algorithms to detect spikes from calcium imaging data, while the suitability of the metrics that assess those algorithms has been predominantly overlooked. In this letter, we presented a novel metric, CosMIC, to assess the similarity of spike train estimates compared to the ground truth. Our results demonstrate that CosMIC accurately discriminates both the temporal and rate precision of estimates with respect to the ground truth.

Using two-photon calcium imaging, the activity of neuronal populations can be monitored in vivo in behaving animals. Inferred spike trains can be used to investigate neural coding hypotheses by analyzing the rate and synchrony of neuronal activity with respect to behavioral variables. To justify such analysis, the ability of spike detection algorithms to generate accurate spike train estimates must be verified. When spike frequency is to be investigated, it is crucial that an estimate accurately matches the rate of the ground-truth spike train. We have shown that the STC is not fit for this purpose; rather than penalizing overestimation of the number of spikes, it is rewarded (see Figure 6). In contrast, CosMIC and the success rate are maximized when the correct number of spikes is detected. When the ultimate goal is to analyze spike timing with respect to other variables, it is critical that spikes can be detected with high temporal precision. We have shown that CosMIC has superior discriminative ability in this regard, compared to the success rate and STC (see Figure 5).

The current inconsistency in the metrics used to assess spike detection algorithms hinders both experimentalists, aiming to select an algorithm for data analysis, and developers. In light of this problem, a recent benchmarking study tested a range of algorithms on a wide array of imaging data (Berens et al., 2017). Although informative, the study, which relied heavily on the STC to assess algorithm performance, may not provide the full picture. By introducing a new metric, we hope to complement such efforts in the pursuit of a thorough, quantitative evaluation of spike inference algorithms.

By construction, CosMIC bears a resemblance to the Sørensen-Dice coefficient, which is commonly used to compare discrete, presence/absence data (Dice, 1945; Sørensen, 1948). This metric, which is also known as the F1-score, is widely used in many fields, including ecology (Bray & Curtis, 1957) and image segmentation (Zou et al., 2004). When applied to spike inference, this coefficient is referred to as the success rate and is one of the two most commonly used metrics. We have demonstrated that this construction confers some of the advantages of the success rate to CosMIC. In particular, CosMIC is able to accurately discriminate the precision and recall of estimated spike trains (see Figure 8). We have also shown the advantages of CosMIC over the success rate; most important, it is more sensitive to a spike train estimate's temporal precision than the success rate (see Figure 5). Furthermore, CosMIC's parameter is defined with respect to the statistics of the data set and, unlike the success rate's bin size, it does not need to be selected by a user.

We demonstrated that CosMIC is boosted with respect to the success rate when temporal precision is relatively high. In particular, as temporal precision approaches the CRB, CosMIC increases to a maximum. It is not clear how close existing algorithms are to this theoretical bound. Nevertheless, it is important to discriminate between the temporal precision of algorithms even if the performance is not yet optimal. For example, if all algorithms produce estimates with error on the order of a sample width, it is still of interest to know which algorithm produces the lowest error. With its graded pulse shape, CosMIC is able to penalize decreasing error in this way.

The width of the pulse is computed from a lower bound on temporal precision (see section 3), which in turn is derived from the statistics of the data set. As a result, the metric will be more lenient for spike inference algorithms on noisier or lower sampling rate data. This is due to our assumption that a metric score should reflect the difficulty of the spike inference problem. To calculate the bound, knowledge of the calcium transient pulse parameters and the standard deviation of the noise is required. These parameters are typically used by algorithms in the spike detection process (Vogelstein et al., 2010; Deneux et al., 2016). Using only one pulse amplitude parameter, which relates to the amplitude of a single spike, is a simplification. Depending on the fluorescent indicator, amplitudes do in fact decrease (Lütcke, Gerhard, Zenke, Gerstner, & Helmchen, 2013) or increase (Chen et al., 2013) at high firing rates. Consequently, CosMIC may be slightly more punitive in the former case than the latter.

The problem of comparing a ground-truth and estimated spike train is analogous to that of comparing spike trains from different neurons. In the spike train metric literature, binless measures have been found to outperform their discrete counterparts (Paiva et al., 2010). It is also common to convolve spike trains with a smoothing pulse prior to analysis (van Rossum, 2001; Schreiber et al., 2003). In that context, the width and shape of the pulse reflect hypotheses about the relationship between neuronal spike trains. A width that is large with respect to the average interspike interval results in a metric tuned to the comparison of neuronal firing rates. Conversely, a relatively small width produces a metric that acts as a coincidence detector. To apply CosMIC to the problem of spike train comparison, one could similarly vary the pulse width to tailor its performance to the neural coding scheme. In the context of spike detection, which we view as a parameter estimation problem, the pulse width is fixed with respect to a lower bound on the precision with which a spike time can be estimated. Setting the width via this bound, which is tailored to calcium imaging data, results in a metric that assesses how accurately parameters have been estimated given the constraints of the data. This approach would need to be altered to extend CosMIC to other applications. We note that in the absence of this pulse width, CosMIC is sufficiently universal to be applied to the comparison of any point processes.

Finally, we note that the developed metric is able to accurately assess an estimate's temporal and rate precision. This information is unified in a single score that summarizes the overall performance of an algorithm. We consider a single summary score to be practical for users who do not have the time or desire to analyze multidimensional trade-offs. Alternatively, CosMIC's ancestor metrics, $RCosMIC$ and $PCosMIC$, can be used to determine the extent to which errors stem from undetected or falsely-detected spikes.

## Appendix: Further Analytical Results

### A.1 Alternative Metric Form

### A.2 Score for Estimate of One Spike

We now derive an expression for the metric score of the estimate of the location of one spike in terms of the temporal error of the estimate, $|u|$. We see that as the temporal precision increases above the threshold precision ($2\epsilon $), the metric score increases monotonically.

### A.3 Metric Score at Precision of CRB

The CRB is commonly used as a benchmark for algorithm performance in parameter estimation problems. In the context of calcium imaging, it has been previously used to evaluate detectability of spikes under different imaging modalities (Reynolds, Oñativia, Copeland, Schultz, & Dragotti, 2015; Schuck et al., 2018). In this case, the CRB reports the minimum uncertainty achievable by any unbiased estimator when estimating the location of one spike. We thus set the width of the pulse to ensure that, on average, an estimate of the location of one spike at the precision of the CRB achieves a metric score of 0.8. This benchmark score is relatively high in the range of the metric, which is between 0 and 1, while allowing leeway to be exceeded.

### A.4 Exact Detection of Subset of True Spikes

### A.5 Exact Detection of All True Spikes with Overestimation

## Acknowledgments

This work was supported by European Research Council starting investigator award, grant 277800, to P. L. D.; Biotechnology and Biological Sciences Research Council, grant BB/K001817/1, to S. R. S.; EU Marie Curie FP7 Initial Training Network, grant 289146, to S. R. S.; CIHR New Investigator Award, grant 288936, to P. J. S.; CFI Leaders Opportunity Fund, grant 28331, to P. J. S.; CIHR Operating Grant, grant 126137, to P. J. S.; and NSERC Discovery Grant, grant 418546-2, to P. J. S.