## Abstract

A central theme in computational neuroscience is determining the neural correlates of efficient and accurate coding of sensory signals. Diversity, or heterogeneity, of intrinsic neural attributes is known to exist in many brain areas and is thought to significantly affect neural coding. Recent theoretical and experimental work has argued that in uncoupled networks, coding is most accurate at intermediate levels of heterogeneity. Here we consider this question with data from in vivo recordings of neurons in the electrosensory system of weakly electric fish subject to the same realization of noisy stimuli; we use a generalized linear model (GLM) to assess the accuracy of (Bayesian) decoding of stimulus given a population spiking response. The long recordings enable us to consider many uncoupled networks and a relatively wide range of heterogeneity, as well as many instances of the stimuli, thus enabling us to address this question with statistical power. The GLM decoding is performed on a single long time series of data to mimic realistic conditions rather than using trial-averaged data for better model fits. For a variety of fixed network sizes, we generally find that the optimal levels of heterogeneity are at intermediate values, and this holds in all core components of GLM. These results are robust to several measures of decoding performance, including the absolute value of the error, error weighted by the uncertainty of the estimated stimulus, and the correlation between the actual and estimated stimulus. Although a quadratic fit to decoding performance as a function of heterogeneity is statistically significant, the result is highly variable with low $R2$ values. Taken together, intermediate levels of neural heterogeneity are indeed a prominent attribute for efficient coding even within a single time series, but the performance is highly variable.

## 1 Introduction

Animals are able to robustly detect important sensory stimuli, using neurons as the central processing units. Thus, sensory neurons, even at initial stages of processing, must accurately encode noisy stimuli to faithfully propagate sensory information to subsequent brain regions not directly exposed to the stimuli. The coordinated activity of populations of neurons can successfully encode stimuli despite the fact that individual neurons respond noisily and nonlinearly to inputs (Dayan & Abbott, 2001). Intrinsic cellular diversity has been proposed as a crucial attribute for efficient coding in many sensory systems, such as olfactory (Padmanabhan & Urban, 2010), auditory (Ahn et al., 2014), electrosensory (Marsat & Maler, 2010), and visual (Chelaru & Dragoi, 2008). Beyond whether intrinsic diversity is important, Tripathy, Padmanabhan, Gerkin, and Urban (2013) showed that an intermediate level of diversity, not just whether there is diversity, could enhance population coding in uncoupled mitral cells in the olfactory bulb. Moreover, Mejias and Longtin (2012) theoretically analyzed how this optimal level of heterogeneity enhances neural coding in uncoupled populations of spiking neurons. However, despite these influential works, we still do not know how well networks can optimally decode stimuli at intermediate heterogeneity without using trial averaging in experimental data and with stochastic models. Also, a more detailed statistical analysis of how neural heterogeneity is related to accurate or optimal decoding of noisy input signals would provide deeper insights to this important question.

Here we address the relevance of diversity in populations of sensory neurons for decoding noisy input stimuli. We fit generalized linear models (GLM) (Pillow, 2007; Pillow et al., 2008; Pillow, Ahmadian, & Paninski, 2011) to *in vivo* electrophysiological recordings of weakly electric fish *Apteronotus leptorhynchus*, from the same animal. GLMs are commonly used in many systems and have many desired features, including the existence of an optimal fit to the encoding model and Bayesian decoding via maximum a posterior (MAP) estimate of the stimulus given the population spiking response (Pillow, Ahmadian, & Paninski, 2011). The rich data set we use consists of long time series of spiking responses for each of the seven recorded cells, all subject to the same realization of random stimulus input. A network consists of a random selection of neurons; in addition, these data enable a given neuron to have a variety of encoding models fit to different segments of time, resulting in a wide range of heterogeneous networks. In a given time interval, the task of decoding the same random input with a network model fit to a random prior time mimics realistic conditions, as opposed to decoding the trial- or time-averaged statistics of inputs (Tripathy, Padmanabhan, Gerkin, and Urban, 2013) or using trial- or time-averaged experimental data to fit a statistical model well. In considering decoding of stimuli for physiologically relevant time periods of several hundreds of milliseconds, we thus have many time intervals of the noisy input to assess decoding. We are therefore able to estimate statistical measures of model fits with sufficient statistical power to assess whether the diversity of the components of GLM is significant for accurate decoding.

We find that the intermediate levels of heterogeneity are indeed a signature of accurate decoding of stimuli, and this result holds for multiple metrics for decoding performance. Specifically, we show that decoding performance measured by the $l1$-norm of error^{1} has a minimum value when performing quadratic regression on error versus intrinsic heterogeneity. When we incorporate Bayesian uncertainty by weighting each time bin inversely with the standard deviation of the MAP estimate, we see similar results, but the quality of fit of the quadratic regression is worse compared to the $l1$-norm. A notable addition in our analysis is the use of $R2$ as a quality-of-fit statistic to assess the strength of the quadratic regression beyond its statistical significance via $p$-values.

In addition to these two error metrics, we consider the Pearson's correlation between the actual and decoded stimulus to assess how well the two entities covary (see Ostojic & Brunel, 2011 and René, Longtin, & Macke, 2020, who used a similar measure of error, among other measures). We find again that correlations are maximal at intermediate levels of heterogeneity. In almost all cases with these three metrics for decoding performance, we find that the quadratic regression fit is statistically significant but noisy with small $R2$ values. We also observe that decoding performance is enhanced as the network size increases in terms of both reduced error and increased correlation strength, consistent with theories of population coding (Shamir & Sompolinsky, 2006; Yarrow, Challis, & Seriès, 2012). Although decoding performance is variable for a particular level of heterogeneity, our detailed study shows that intermediate levels of heterogeneity in all of the core attributes of GLM can result in optimal decoding of stimuli and that this trend holds for all of the network sizes we considered.

## 2 Results

^{2}so they all have the same instance of stimulus input. Note how the two filters $(k,h)$ and the shift $(b)$ vary across the seven neurons under the same conditions, indicative of intrinsic heterogeneity in the electrosensory lateral line lobe (ELL) cells. A GLM for each cell will also vary depending on the time period used to fit, even within the same neuron, because of its dynamic response.

Figure 2C shows comparisons of the decoded stimulus (black) and the actual stimulus (cyan) for different network sizes ($N=1,4,7$). Here and for all subsequent figures and results, we use 1397.5 ms (559 bins of 2.5 ms) to fit the GLM encoding model and decode for 417.5 ms (167 bins of 2.5 ms), approximately 30% of the in-sample training period. As the cell population increases, we see that the decoded stimulus falls more in line with the actual stimulus in these examples. However, with a fixed population size, the results can vary dramatically from poor (top row) to good (bottom row) decoding. Given these confounding results, we aim to statistically quantify how intrinsic heterogeneity is related to the quality of decoding performance by generating many random networks and considering many random time segments to decode.

### 2.1 Heterogeneity versus Decoding Error

Networks were constructed by randomly selecting cells from among all seven neurons to obtain a prescribed network size, and GLMs for each cell were fit on random segments of time prior to the decoding time segment. Note that a given network of cells decodes stimuli in the same time segment. A given network has three measures of heterogeneity based on the encoding models' component (stimulus filters, postspike filters, and constant bias); we define heterogeneity as the variability across these respective components. Following the convention in Tripathy et al. (2013), the stimulus filter heterogeneity and postspike filter heterogeneity are both determined by the mean of the Euclidean distances between all distinct pairs of filter values or coefficients at all vector values, while the bias heterogeneity is the standard deviation of the $b$ values.

For the unweighted Error, the $R2$ values are somewhat notable, especially for larger networks (see Table 1), while for Hessian-weighted error, the $R2$ values are generally much smaller (see Table 2) but an intermediate minimum is still found. So when uncertainty of the decoder is accounted for, heterogeneity of the network has less influence on decoding.

Besides the differences between the $l1$-norm of error and the Hessian-weighted error, note that all of these quadratic models are statistically significant with minuscule $p$-values due to the large sample size, but the actual fits can be of questionable merit (e.g., $R2=0.0019$ for stimulus filter heterogeneity versus the Hessian-weighted error at network size $N=3$). This contrast demonstrates the usefulness of quality-of-fit metrics when employing regression to reveal a fuller story. With Hessian-weighted error, the intermediate levels of heterogeneity for optimal decoding are not as robust compared to the $l1$-error. In particular, for postspike filter heterogeneity for network sizes $N=2$, 3, 4, 5, the optimal level of heterogeneity is not in the domain (see Figures A4 and A5 and Table 2). However, a similar limitation in explanatory power occurs for postspike filter heterogeneity against the $l1$-norm; although the optimal heterogeneity value is in the interior for $N=3$, 4, the heterogeneity values are relatively small (see Figure A2). In general, this suggests possible limitations of postspike filter heterogeneity as an overall characterization of network heterogeneity.

### 2.2 Heterogeneity versus Correlation

Besides evaluating heterogeneity levels that minimize error, we determine decoding effectiveness by measuring the Pearson's correlation between the actual and decoded stimuli, where larger correlation values coincide with better decoding. While error measures how closely the decoded stimulus matches the actual stimulus in precision, correlation measures how well the decoded stimulus covaries with the actual stimulus, with less emphasis on the precise values of the stimuli. It is conceivable that downstream neurons do not need to have the exact copies of the incoming stimuli to propagate the necessary information, so the error metrics may be too demanding. Ostojic and Brunel (2011) used Pearson's correlation to measure the accuracy between the predicted instantaneous firing rate of GLM-like model with simulations of a spiking model; René et al. (2020) also used Pearson's correlation to measure the accuracy of a mesoscopic mean-field-like reduction of a large spiking model. One issue with correlation is that two time series can have high correlation but the basic statistics (i.e., mean, variance) can be very different (see the Methods section in Ostojic & Brunel, 2011). Thus, the correlation coefficient is not intended to replace error as a measure in our study but rather to provide more thorough information. Although both correlation and error as measures of decoding accuracy have potential flaws, they complement each other well (Ostojic & Brunel, 2011; René et al., 2020).

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0293 | 0.0140 | 0.0347 | 0.0534 | 0.0725 | 0.0974 |

$p$-value for $F$-statistic versus Constant | $3.9\xd710-42$ | $1.8\xd710-20$ | $7.1\xd710-50$ | $4.9\xd710-77$ | $2.2\xd710-105$ | $3.9\xd710-143$ | |

$h$ | $R2$ | 0.0103 | 0.01794 | 0.0606 | 0.0382 | 0.0530 | 0.0888 |

$p$-value for $F$-statistic versus Constant | $3.6\xd710-15$ | $6.7\xd710-26$ | $1.4\xd710-87$ | $6.7\xd710-55$ | $2.3\xd710-76$ | $5.2\xd710-130$ |

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0293 | 0.0140 | 0.0347 | 0.0534 | 0.0725 | 0.0974 |

$p$-value for $F$-statistic versus Constant | $3.9\xd710-42$ | $1.8\xd710-20$ | $7.1\xd710-50$ | $4.9\xd710-77$ | $2.2\xd710-105$ | $3.9\xd710-143$ | |

$h$ | $R2$ | 0.0103 | 0.01794 | 0.0606 | 0.0382 | 0.0530 | 0.0888 |

$p$-value for $F$-statistic versus Constant | $3.6\xd710-15$ | $6.7\xd710-26$ | $1.4\xd710-87$ | $6.7\xd710-55$ | $2.3\xd710-76$ | $5.2\xd710-130$ |

Note: The number in bold corresponds to quadratic fits where the optimal level of heterogeneity is not in the interior.

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0068 | 0.0019 | 0.0060 | 0.0191 | 0.0315 | 0.0401 |

$p$-value for $F$-statistic versus Constant | $3.6\xd710-10$ | 0.0022 | $4.5\xd710-9$ | $1.5\xd710-27$ | $3.0\xd710-45$ | $1.6\xd710-57$ | |

$h$ | $R2$ | 0.0014 | 0.0020 | 0.0101 | 0.0171 | 0.0181 | 0.0356 |

$p$-value for $F$-statistic versus Constant | 0.0032 | $3.8\xd710-4$ | $7.2\xd710-15$ | $1.2\xd710-24$ | $4.3\xd710-26$ | $4.5\xd710-51$ |

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0068 | 0.0019 | 0.0060 | 0.0191 | 0.0315 | 0.0401 |

$p$-value for $F$-statistic versus Constant | $3.6\xd710-10$ | 0.0022 | $4.5\xd710-9$ | $1.5\xd710-27$ | $3.0\xd710-45$ | $1.6\xd710-57$ | |

$h$ | $R2$ | 0.0014 | 0.0020 | 0.0101 | 0.0171 | 0.0181 | 0.0356 |

$p$-value for $F$-statistic versus Constant | 0.0032 | $3.8\xd710-4$ | $7.2\xd710-15$ | $1.2\xd710-24$ | $4.3\xd710-26$ | $4.5\xd710-51$ |

Note: The numbers in bold correspond to quadratic fits where the optimal level of heterogeneity is not in the domain of heterogeneity.

Pearson's correlation of the two time series directly is inappropriate because the assumption of independent observations is violated when time series have nontrivial autocorrelations. The Pearson's correlation (also the cross-correlation at lag 0) is affected by the cross-correlation and autocorrelations between the two time series at other lags. As a result, these Pearson's correlation coefficient estimates are mathematically biased relative to the true correlation strength, so we use a method known as *prewhitening* (Brockwell, Davis, & Calder, 2002) to be statistically rigorous. Essentially, if we fit an effective model to a time series, the residuals should be a white noise process. That is, for the two time series $Yt$ and $Xt$ and their corresponding best-fit models $Y^t$ and $X^t$, the following should be two white noise processes: $et=Yt-Y^t$ and $wt=Xt-X^t$. The Pearson's correlation between $et$ and $wt$ (also their cross-correlation function at lag 0) would be the unbiased estimate of the correlation strength between the two original time series. (For more details and the method of model fitting used on all samples, see section 4.)

After prewhitening, we fit a quadratic model between heterogeneity and adjusted (prewhitened) correlation, which is found to be statistically significant with all network sizes and for all heterogeneity types, albeit using a higher significance level ($\alpha =0.05$) (see Table 3 for stimulus filter and postspike filter heterogeneity; see Table A2 for all). As with the two error types, we see fluctuating $R2$ values as population size increases, with the best fits found with a population of size 6 or 7. However, $R2$ values for the different heterogeneity types and for different network sizes are all relatively small, sometimes even less than $R2=0.01$ and never more than $R2=0.04$; the plots in Figure 4 confirm these results. For networks with two cells ($N=2$), adjusted correlation appears to be fairly constant across the domain of stimulus heterogeneity, confirmed by the relatively flat quadratic curve and the weak quality of fit ($R2=0.0018$), despite the quadratic model being “statistically significant.” The curve is more pronounced, though, for $N=4$ and $N=6$, in line with larger (yet still small) $R2$ values ($R2=0.0076$ and $R2=0.0212$, respectively).

Note that as the network size increases,^{4} we see fewer homogeneous networks, with minimal data for stimulus filter heterogeneity less than 0.15 at $N=4$ and $N=6$. We also see a diminishing number of extremely heterogeneous networks, as the distribution of heterogeneity becomes tighter toward intermediate heterogeneity (albeit still positively skewed). Also, with the other heterogeneity measures, the standard deviation generally decreases as population size increases (see Figures A6 and A7). The statistical significance of these models suggests that there is still some underlying dependence on heterogeneity, but the effect is not strong and we see that network size also has an interacting effect. We will next explore the notable direct effects that population size has on decoding efficiency on error types and adjusted correlation.

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0018 | 0.0123 | 0.0076 | 0.0149 | 0.0212 | 0.0357 |

$p$-value for $F$-statistic versus Constant | 0.003 | $6.0\xd710-18$ | $2.2\xd710-11$ | $1.4\xd710-21$ | $1.6\xd710-30$ | $2.9\xd710-51$ | |

$h$ | $R2$ | 0.0199 | 0.0159 | 0.0166 | 0.0156 | 0.0232 | 0.0356 |

$p$-value for $F$-statistic versus Constant | $1.1\xd710-28$ | $4.6\xd710-23$ | $5.3\xd710-24$ | $1.4\xd710-22$ | $2.4\xd710-33$ | $4.4\xd710-51$ |

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0018 | 0.0123 | 0.0076 | 0.0149 | 0.0212 | 0.0357 |

$p$-value for $F$-statistic versus Constant | 0.003 | $6.0\xd710-18$ | $2.2\xd710-11$ | $1.4\xd710-21$ | $1.6\xd710-30$ | $2.9\xd710-51$ | |

$h$ | $R2$ | 0.0199 | 0.0159 | 0.0166 | 0.0156 | 0.0232 | 0.0356 |

$p$-value for $F$-statistic versus Constant | $1.1\xd710-28$ | $4.6\xd710-23$ | $5.3\xd710-24$ | $1.4\xd710-22$ | $2.4\xd710-33$ | $4.4\xd710-51$ |

Notes: Correlations were prewhitened for the two time series to ensure unbiased estimates. Bold text corresponds to quadratic fits where the optimal level of heterogeneity is not in the interior.

### 2.3 Effect of Population Size on Decoding

We have already observed that the impact of neural heterogeneity on decoding error changes as the network size increases. As more neurons are added to the network, the result is generally better decoding of the stimulus. From Figure 5A, we can see that the $l1$-norm of error, averaged over all samples (black), decreases with increased population size, though the decrease is reduced and saturates around $N=6$. Also the minimum of the quadratic fit decreases for all three forms of heterogeneity (colored curves) except for $N=5$ to 6 for postspike filter heterogeneity (green). With a one-way analysis of variance (ANOVA), the mean difference in error between population sizes is statistically significant ($F=450.18$, $p$-value $\u2248$ 0). Notably, this is not simply an artifact of massive sample sizes, as the effect size is moderate ($\eta 2=0.059$). We observe a significant difference in pairwise comparisons for population sizes 2 through 6, measured via the Tukey honest significant difference (HSD) test (Montgomery, 2017), but there is not a significant difference between population sizes 6 and 7 as the decrease in error saturates. The results are similar with the Hessian-weighted error (see Figure 5B, black curve); an increased number of cells leads to more effective decoding, but the marginal improvement diminishes with increased population size. As before with the $l1$-norm of error, the impact of increased population is still statistically significant ($F=108.58$, $p$-value $=$$2.887\xd710-114$), but the practical impact is diminished, with a small effect size of $\eta 2=0.014$. In addition, based on the Tukey HSD test, population sizes 4 and 5 do not appear to be significantly different from one another, and population sizes 5, 6, and 7 do not appear to be significantly different as the marginal improvement in error declines faster for Hessian-weighted error than for the $l1$-norm of error. Once again, the comparison between the types of error reveals how uncertainty affects these measures of average error. Average decreases in error still occur when incorporating uncertainty about the MAP estimates, but some of network size's explanatory power on error is lost when adjusting for uncertainty.

With prewhitened correlation values, we assess the impact of population size with another ANOVA test and set of Tukey HSD tests. All networks are independently and randomly sampled from the time series, but a Pearson's correlation will have a skewed distribution whenever it is not centered around $\rho =0$. Thus, a Fisher $z$ transformation is necessary to create approximately normal distributions for each group; the transformed value for Pearson's correlation coefficient $r$ is simply $z=arctanh(r)$ (Vrbik, 2005). In Figure 5C, we see a similar improvement in decoding as the network size increases, which we find to be statistically significant with ANOVA on our transformed values ($F=1411.64$, $p$-value $\u2248$ 0) with a notably large effect size ($\eta 2=0.181$), much larger than the effect size than previously observed with $l1$-norm and Hessian-weighted errors. The prewhitened correlations are significantly different among all population sizes (albeit using very large sample sizes), and the correlation strength improves at each step up in network size with a similar diminishing marginal improvement as the two error types. The exceptions to these results are specifically for bias heterogeneity for $N=2$ and postspike filter heterogeneity for $N=5$ (see Figure 5C, red and green, respectively) where the optimal level of heterogeneity occurs outside the domain and not at intermediate values (see Figures A6 and A7, and blue entries in Table A2). Overall, the prewhitened correlation values for the networks prove to be useful metrics when assessing the impact of population size on decoding efficiency.

### 2.4 Remarks about Time Intervals for Encoding and Decoding

We have chosen to set the encoding time period to fit the GLM and decoding period to be fixed for all figures (1397.5 ms and 417.5 ms). To the best of our knowledge, there does not appear to be a systematic or algorithmic way of determining ideal time lengths for these entities, or for the time bins (we have chosen 2.5 ms). We perform many simulations to assess how the statistics of the errors ($l1$-Error and Hessian-weighted error) changes with encoding and decoding time intervals, while neglecting the effects of heterogeneity that should be qualitatively similar (see Figure 6). For these simulations, we generate 6500 random networks before, but fixing $N=4$ because it is an intermediate network size that has the most possibilities for choosing different cells.

Figures 6A and 6B show error distributions via box plots for five different encoding time periods (each color corresponding to a fixed time), as a function of the decoding time. The median error and spread of errors (whiskers) generally decrease as times increase, perhaps an indication of convergence. Figures 6C and 6D present a similar plot but group the decoding time as a percentage of the five different encoding times; again we see for these values that increasing encoding times leads to lower median error and less spread of errors (whiskers). Our choice throughout this letter appears to have very similar box plots compared to larger encoding times and larger decoding times.

## 3 Discussion

Addressing how neurons in various sensory systems achieve coding efficiency has been an active area of investigation both theoretically and experimentally. Using *in vivo* voltage recordings from a weakly electric fish electrosensory system and a frequently used Bayesian statistical model (GLM), we analyzed how intrinsic heterogeneity of uncoupled networks of distinct neurons affects decoding from downstream neurons. We leveraged the long electrophysiological recordings to obtain large data sets for each recorded cell that enabled us to address this question with thorough and proper statistical analyses, considering three important metrics of decoding accuracy. We found that intermediate levels of heterogeneity robustly lead to the best decoding performance, consistent with the results of Tripathy et al. (2013). However, an important caveat is that the relative decoding performances were quite noisy and that a quality-of-fit metric such as $R2$ should always come with conclusions about statistical significance.

This letter demonstrates a careful statistical application on data from the electrosensory system provided to us from Marsat and Maler (2010) that may be instructive for other neural systems, but we did not address some details of heterogeneity and coding specific to this system. In the electrosensory system, prior work has shown that neural attribute heterogeneities of P-units that are a prelude to ELL have a strong effect on some of the response properties: with fast frequency stimuli and a slower envelope frequency, only the slower envelope frequency responses were affected by heterogeneities, while the fast frequency P-unit responses are invariant to neural heterogeneity (Metzen & Chacron, 2015). Also, the heterogeneity of ON and OFF cell types in the ELL generally led to improved population coding of envelope signals (Hofmann & Chacron, 2020); here, the data provided consisted only of ON cells. The stimuli we considered consist of random amplitude modulations with gaussian statistics rather than more natural stimuli with faster and slower envelope frequencies. This more natural stimuli are known to elicit heterogeneous responses of ELL pyramidal cells where some inputs are “cancelled,” perhaps by parallel fiber inputs to ELL (Bol, Marsat, Harvey-Girard, Longtin, & Maler, 2011; Bol, Marsat, Mejías, Maler, & Longtin, 2013; Mejias, Marsat, Bol, Maler, & Longtin, 2013). These cancellation mechanisms, which are likely a large factor for the resulting ELL cell heterogeneity, were not addressed here but would be interesting to further explore with a more anatomically realistic Bayesian model. The GLM assumes all statistics are stationary when in fact the electrosensory system is known to have longer timescales (plasticity) that can violate these assumptions. However, nonstationarities from envelope frequencies or up-down dynamics that are slow (Melanson, Mejias, Jun, Maler, & Longtin, 2017) may not appreciably alter the GLM decoding performance because the statistics are close to stationary.

Although we only have recordings from 7 distinct neurons, and it is likely that more distinct sensory neurons are required for real coding of signals, note that the exact number of relevant neurons in the lateral segment of the ELL to code stimuli (and indeed many other brain regions) is unknown. Our work is a theoretical exploration of the signatures of decoding in a subset of the population over relevant lengths of time, where individual neurons could have learned how to encode in prior time periods. The use of these 7 cells and encoding models fit to different time periods results in appreciable ranges of heterogeneity. Our use of 7 ELL pyramidal cells is on a similar order or scale with other published works on this system: 12 cells in Mejias et al. (2013) across different animals, 15 cells in Ly and Marsat (2018), 5 cells in Bastian (1986), 12 cells (and 9 pairs) in Simmonds & Chacron (2015), and 14 cells in Huang, Zhang, and Chacron (2016) but much smaller than others: 50 cells in Hofmann and Chacron (2020), and 77 cells in Chacron, Maler, and Bastian (2005). With the limited number of distinct neurons and all aforementioned limitations, we still observe effective neural coding at intermediate levels of heterogeneity.

Like Tripathy et al. (2013), we developed encoding and decoding models for uncoupled networks and used the same measures for heterogeneity. We observed a similar quadratic relationship between heterogeneity and the $l1$-norm of error and confirmed these results as statistically significant, but we did not see the same diminishing impact of heterogeneity as the population size increased. In fact, the best quadratic fits occurred at relatively larger network sizes 6 and 7. Tripathy et al. (2013) do not address the practical significance of the quadratic relationship as thoroughly as we did here; they relied on statistical significance in a case of large sample sizes where small $p$-values are more easily attained. Though statistical significance is an important baseline, indicative of a relationship, quality-of-fit metrics such as $R2$ are also important, especially with the noisy data naturally produced in neural networks. Tripathy et al. (2013) used trial-averaged data, which helps to mitigate the noisiness of neurons in order to discern underlying dynamics in the encoding model. Our focus was on the arguably more biologically realistic case of single trials in a long time series. Of course there are merits to trial averaging, especially for encoding models, and one might observe larger $R2$ values in this framework. Tripathy et al. (2013) also employed greedy search methods to produce a more uniform spread of heterogeneity, while we chose a simpler uniform random selection that did not result in the same uniformity of heterogeneity, but did reveal natural clustering at intermediate levels of heterogeneity, particularly as network size increased.

We also incorporated additional metrics to analyze decoding efficiency. Since Bayesian methods also quantify uncertainty, we implemented Hessian-weighted absolute error. By comparing this error metric to the $l1$-norm of error, we could observe the weakening of the quadratic regression models and the ANOVA models when weighting error with uncertainty. Such a phenomenon implies that heterogeneity and population size are explaining some of the same neural dynamics as that explained by the relative level of uncertainty in the decoded stimuli, though properly exploring this is beyond the scope of our study. Finally, since decoding of noisy spiking outputs will naturally produce some error that can only be partially accounted for by heterogeneity or network size, we considered Pearson's correlation between the actual and decoded stimulus time series for thoroughness. Perhaps downstream neurons do not need to have the exact copies of the incoming stimuli to decode, in which case the prior two error measures could be misleading. The correlation metric is problematic in isolation because there can a high correlation but drastically different values. Following Ostojic and Brunel (2011) and René et al. (2020), we used several different metrics to provide a fuller picture of our results. Despite these different measures of decoding, we generally have the same results.

This letter has focused on the intrinsic heterogeneity between neurons in terms of how they code stimuli and how their own spiking dynamics vary. The data set we use contains separate recordings of an awake animal, not simultaneous recordings, which necessitated our assumption of uncoupled networks. Coupled networks are more challenging and would certainly require considering other forms of heterogeneity, for example, network heterogeneity that can interact nonlinearly with intrinsic heterogeneity (Ly, 2015; Ly & Marsat, 2018; Wendling & Ly, 2019), and also a more complicated GLM with coupling (Pillow et al., 2008). Addressing this research question with simultaneous recordings in coupled networks is an important future direction but beyond the scope of this study.

The GLM employed here allows efficient and robust computations of both the encoding model and decoding of input signals, in contrast to biophysical models or integrate-and-fire type spiking models where a maximum a posterior estimate of the input stimulus is computationally expensive to calculate, especially with time intervals we considered with hundreds of dimensions for *each* network (Ladenbauer, McKenzie, English, Hagens, & Ostojic, 2019). There has been a plethora of theoretical advances to capture how noisy stimuli nonlinearly alter the statistics of spiking in biophysical models (incomplete list: Dayan & Abbott, 2001; René et al., 2020), but such models are less suitable for the framework here because GLM enables robust fits to data and stimulus estimates for all model networks we considered. However, Ladenbauer et al. (2019) recently used integrate-and-fire type models, a Bayesian framework, and maximum likelihood to successfully estimate network inputs, as well as various intrinsic and network connectivity attributes, with validations from *in vivo* and *in vitro* data. These authors developed methods to infer precise statistical quantities of neural networks that did not rely on particular realizations of the data or models, whereas we focused on estimating the particular realizations of noisy stimuli from particular spike trains in the data, repeating this procedure for each network.

Although we considered a single time series for each neuron without trial-averaged data, we generally observe an increase in performance as population size increases. This observation is consistent with theories of population coding, where larger network sizes are known to theoretically increase the Fisher information and accuracy of an optimal linear estimator (Shamir & Sompolinsky, 2006), among other metrics. Others have studied how Fisher (Yarrow et al., 2012; Sompolinsky, Yoon, Kang, & Shamir, 2001) and mutual information (Yarrow et al., 2012) change (increase) with population size, results that are at least consistent with what we observe here even though we only consider networks up to size 7. These and other studies (Sompolinsky et al., 2001; Shamir & Sompolinsky, 2006; Ecker, Berens, Tolias, & Bethge, 2011; Yarrow et al., 2012) theoretically accounted for the effects of noise correlations along with heterogeneity and population size, an issue we do not address here because of separate cellular *in vivo* recordings. A recent study by Gjorgjieva, Meister, and Sompolinsky (2019) found with uncoupled GLM-like models with a threshold nonlinearity that in addition to homogeneous networks performing poorly, intermediate proportions of ON/OFF cells performed optimally via maximizing mutual information and minimizing error with the optimal linear estimator. This study is consistent with the results here.

## 4 Methods

Matlab code implementing these methods can be found here: https://github.com/wendlingk/NeuroGLM. The code is based on previously written code by Pillow et al. (2008), Park, Meister, Huk, and Pillow (2014), and Tripathy et al. (2013).

The experimental data provided by the Marsat Lab from a prior paper: Marsat and Maler (2010). All seven cells are from the same animal from the same area, termed lateral segment, and are all ON cells subject to spatially global random amplitude modulation (RAM) stimuli.

### 4.1 Generalized Linear Model

We adopt previous GLM construction detailed in Pillow et al. (2011) without coupling, where neural spike trains are modeled by an inhomogeneous Poisson process, with 1 representing a spike and 0 representing no spike. A key function that represents the expectation of spiking is $\lambda (t)$, the rate intensity at time $t$, also known as the *instantaneous firing rate*. The function $\lambda (t)$ contains components that relate nicely to some of the underlying physiology: stimulus filtering and dependence on spike history, and a constant (bias) shift $b$. The stimulus vector $xt$ is transformed by a temporal filter called the *stimulus filter*$k$, which is a series of coefficient estimates for different lags in time. Dependence on spike history is modeled by another temporal filter, the postspike filter $h$, which captures the refractory period and other dynamics (i.e., adaptation, depression). Since $h$ also affects the spike train $rt$, one can think of the instantaneous firing rate as $\lambda (t|xt,rt)$, where $xt$ and $rt$ are vectors in the time window immediately prior to time $t$.

*Poisson regression*model more generally in statistics (Myers et al., 2012) and as a

*linear-nonlinear Poisson (LNP) cascade*model in the context of neural encoding models. In neural coding literature, the log link is referred to as the

*static*or

*point nonlinearity*(Paninski, 2004; Simoncelli, Paninski, & Pillow, 2004; Pillow, 2007).

#### 4.1.1 Maximum Likelihood Estimators for the GLM

^{3}as our optimization routine for the function fminunc in Matlab. This method employs the direction of steepest descent, determined by the negation of the log-likelihood gradient $\u2207f(\beta )=\u2202\u2202\beta (ln[p(rt|xt,\beta )])$, and it also constructs the Hessian matrix, the negation of the second derivative of the negative log likelihood. Since the neurons are uncoupled, the GLM parameter estimates can be calculated independently for each cell. Differentiating in terms of $\beta i$ for cell $i$, the gradient and the Hessian for the likelihood are the following:

#### 4.1.2 Optimal Filter Lengths

Both the stimulus filter and the postspike filter are implemented as vectors; here we describe the method used to set the optimal vector lengths. If the vectors are too short, we risk excluding relevant information and biasing the predictions. However, if they are too long, we include variables that do not have strong predictive power and risk overfitting the training data. We minimize the negative log likelihood (*NLL*) over a large random sample. We split the time series into 10 training sets, and in each set, we systematically consider combinations of stimulus filter lengths and postspike filter lengths. Note that 2*NLL* follows a chi-squared distribution and independent chi-squared variables can be summed. Since the 10 training sets are independent of one another, we add all of the *NLL* values for the training sets to get a summed negative log-likelihood value for every filter length combination. From the heat map in Figure A1C, we can see that the optimal stimulus filter and postfilter combinations are at times $80ms$ and $90ms$ for the stimulus filter and time $180ms$ for the postspike filter.^{4} We choose to use fewer coefficients in the stimulus filter for computational efficiency.

### 4.2 Decoding of Stimulus

For each cell in the uncoupled network, the GLM is constructed from a random segment in time of length $200ms$ (80 time bins of $2.5ms$ each), where each cell could be fit to a different time period. The decoding of the stimulus, or MAP estimate, is performed on a random $417.5ms$ time interval after the latest time segment used to fit the model. Note that a given network of cells decodes stimuli in the same time segment.

### 4.3 Prewhitening of Time Series for Unbiased Correlations

Prewhite-ning the actual and decoded time series is accomplished by transforming it to a white noise process. Given a time series $Yt$ and a model $Y^t$, we can transform the time series into white noise $et=Yt-Y^t$, assuming that we have a well-fit model. The best-fit model likely involves a similar number of time series parameters for both the actual and decoded stimulus, given their similarities.

Before we can fit an ARIMA time series model, we must confirm the stationarity of the time series, selecting the augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test (adftest and kpsstest in Matlab) for our purposes. The null hypothesis of the ADF test is that there is a unit root present in the time series, while the alternative hypothesis indicates a type of stationarity. Meanwhile, the KPSS test is reversed, with the alternative hypothesis indicating a unit root, while the null hypothesis is that the time series is trend-stationary. Essentially, for each time series, we want to reject the null hypothesis of the ADF test, using significance level $\alpha =0.001$, and fail to reject the null hypothesis of the KPSS test, using $\alpha =0.1$. The contrasting significance levels make it difficult to confirm the alternative hypothesis of stationarity with the ADF test and easier to confirm the alternative hypothesis of a unit root process with the KPSS test, so we can carefully ensure stationarity. When we run the ADF test on all the actual and decoded stimuli, we reject the null hypothesis of a unit root in almost all cases. When conducting the KPSS test on the same sample of time series, though, we reject the null hypothesis of trend-stationarity a notable number of times in favor of the alternative, a unit root present in the time series. Since we have some evidence for nonstationarity, we next take the difference of every time series ($Yt-Yt-1$ for $t=2,\u2026,T$) to resolve this issue while maintaining consistency across the entire sample of networks. Once we have differenced all time series, we find desirable results across the whole sample of time series, with all rejecting the null hypothesis with the ADF test and all failing to reject the null hypothesis with the KPSS test.

With consistent stationarity confirmed on once-differenced ($d=1$) time series, we fit an ARIMA model $ARIMA(p,d=1,q)$ to $Yt-Yt-1$ for each of them. From a common example in Figure A1B, we can see significant lags in the autocorrelation function (ACF) before a rapid drop-off to insignificance beyond $5ms$, while the partial autocorrelation function (PACF) shows a dampened oscillatory pattern of significant lags before insignificance past $8ms$.^{5} These dynamics together suggest that we should employ an even number of autoregressive lags $p$ to capture the oscillatory pattern, with a complementing number of moving average lags $q$. Here, the best model constructions tend to have half as many moving average lags as autoregressive lags, likely due to this oscillatory pattern. Sampling and analyzing a subset of time series, we find the best fit is $ARIMA(6,1,3)$ or $ARIMA(8,1,4)$, followed by $ARIMA(4,1,2)$. Our goal is to efficiently fit thousands of time series with reliable models, so we use a try and catch routine within our iterations to try to fit an ARIMA model, replacing one ARIMA structure with another if matrix invertability or stability issues occur. With a goal of balancing quality of fit with parsimony, we select $ARIMA(6,1,3)$ as our initial attempt, with each time series using the estimate function in Matlab. If a model fails with this attempt, we next attempt $ARIMA(8,1,4)$ and then finally, $ARIMA(4,1,2)$.

Every time series is effectively fit to one of these three *ARIMA* models. Using the infer function, we then capture the residual vector obtained from subtracting these models from the actual values. This prewhitening method results in residual vectors that are white noise processes. The Pearson's correlation (or the cross-correlation function at lag 0) for these pairs of residual vectors is an unbiased estimate of the linear correlation strength (Brockwell et al., 2002).

### 4.4 Remarks for Heterogeneity Analysis

For a given network size ($N=2$, 3, 4, 5, 6, 7), 6500 random networks were generated, but our completely random selection resulted in extreme values of heterogeneity that we considered as outliers. Thus, we estimated the CDF of the heterogeneity values using a kernel smoothing function (ksdensity in Matlab with bandwidth set to 0.1), and removed the top 1.5 percentile of points. There were at least $98.5%$ of points that remained (see Table 4 for the total number of points removed out of 6500).

Heterogeneity \Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|

$k$ | 87 | 73 | 94 | 97 | 95 | 98 |

$h$ | 96 | 96 | 94 | 91 | 95 | 93 |

$b$ | 91 | 83 | 97 | 94 | 96 | 92 |

Heterogeneity \Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|

$k$ | 87 | 73 | 94 | 97 | 95 | 98 |

$h$ | 96 | 96 | 94 | 91 | 95 | 93 |

$b$ | 91 | 83 | 97 | 94 | 96 | 92 |

## Appendix

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0293 | 0.0140 | 0.0347 | 0.0534 | 0.0725 | 0.0974 |

$p$-value for F-statistic versus Constant | $3.9\xd710-42$ | $1.8\xd710-20$ | $7.1\xd710-50$ | $4.9\xd710-77$ | $2.2\xd710-105$ | $3.9\xd710-143$ | |

$h$ | $R2$ | 0.0103 | 0.01794 | 0.0606 | 0.0382 | 0.0530 | 0.0888 |

$p$-value for F-statistic versus Constant | $3.6\xd710-15$ | $6.7\xd710-26$ | $1.4\xd710-87$ | $6.7\xd710-55$ | $2.3\xd710-76$ | $5.2\xd710-130$ | |

$b$ | $R2$ | 0.0141 | 0.0164 | 0.0357 | 0.0577 | 0.0620 | 0.0918 |

$p$-value for F-statistic versus Constant | $1.6\xd710-20$ | $9.1\xd710-24$ | $3.3\xd710-51$ | $2.1\xd710-83$ | $1.2\xd710-89$ | $1.3\xd710-134$ | |

Heterogeneity | Network Size | 2 | 3 | 4 | 5 | 6 | 7 |

$k$ | $R2$ | 0.0068 | 0.0019 | 0.0060 | 0.0191 | 0.0315 | 0.0401 |

$p$-value for F-statistic versus Constant | $3.6\xd710-10$ | 0.0022 | $4.5\xd710-9$ | $1.5\xd710-27$ | $3.0\xd710-45$ | $1.6\xd710-57$ | |

$h$ | $R2$ | 0.0014 | 0.0020 | 0.0101 | 0.0171 | 0.0181 | 0.0356 |

$p$-value for F-statistic versus Constant | 0.0032 | $3.8\xd710-4$ | $7.2\xd710-15$ | $1.2\xd710-24$ | $4.3\xd710-26$ | $4.5\xd710-51$ | |

$b$ | $R2$ | 0.0029 | 0.0047 | 0.0062 | 0.0222 | 0.0290 | 0.0329 |

$p$-value for F-statistic versus Constant | $8.3\xd710-5$ | $2.5\xd710-7$ | $2.4\xd710-9$ | $6.5\xd710-32$ | $1.4\xd710-41$ | $3.2\xd710-47$ |

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0293 | 0.0140 | 0.0347 | 0.0534 | 0.0725 | 0.0974 |

$p$-value for F-statistic versus Constant | $3.9\xd710-42$ | $1.8\xd710-20$ | $7.1\xd710-50$ | $4.9\xd710-77$ | $2.2\xd710-105$ | $3.9\xd710-143$ | |

$h$ | $R2$ | 0.0103 | 0.01794 | 0.0606 | 0.0382 | 0.0530 | 0.0888 |

$p$-value for F-statistic versus Constant | $3.6\xd710-15$ | $6.7\xd710-26$ | $1.4\xd710-87$ | $6.7\xd710-55$ | $2.3\xd710-76$ | $5.2\xd710-130$ | |

$b$ | $R2$ | 0.0141 | 0.0164 | 0.0357 | 0.0577 | 0.0620 | 0.0918 |

$p$-value for F-statistic versus Constant | $1.6\xd710-20$ | $9.1\xd710-24$ | $3.3\xd710-51$ | $2.1\xd710-83$ | $1.2\xd710-89$ | $1.3\xd710-134$ | |

Heterogeneity | Network Size | 2 | 3 | 4 | 5 | 6 | 7 |

$k$ | $R2$ | 0.0068 | 0.0019 | 0.0060 | 0.0191 | 0.0315 | 0.0401 |

$p$-value for F-statistic versus Constant | $3.6\xd710-10$ | 0.0022 | $4.5\xd710-9$ | $1.5\xd710-27$ | $3.0\xd710-45$ | $1.6\xd710-57$ | |

$h$ | $R2$ | 0.0014 | 0.0020 | 0.0101 | 0.0171 | 0.0181 | 0.0356 |

$p$-value for F-statistic versus Constant | 0.0032 | $3.8\xd710-4$ | $7.2\xd710-15$ | $1.2\xd710-24$ | $4.3\xd710-26$ | $4.5\xd710-51$ | |

$b$ | $R2$ | 0.0029 | 0.0047 | 0.0062 | 0.0222 | 0.0290 | 0.0329 |

$p$-value for F-statistic versus Constant | $8.3\xd710-5$ | $2.5\xd710-7$ | $2.4\xd710-9$ | $6.5\xd710-32$ | $1.4\xd710-41$ | $3.2\xd710-47$ |

Note: The numbers in bold correspond to quadratic fits where the optimal level of heterogeneity is not in the interior.

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0018 | 0.0123 | 0.0076 | 0.0149 | 0.0212 | 0.0357 |

$p$-value for F-statistic versus Constant | 0.003 | $6.0\xd710-18$ | $2.2\xd710-11$ | $1.4\xd710-21$ | $1.6\xd710-30$ | $2.9\xd710-51$ | |

$h$ | $R2$ | 0.0199 | 0.0159 | 0.0166 | 0.0156 | 0.0232 | 0.0356 |

$p$-value for F-statistic versus Constant | $1.1\xd710-28$ | $4.6\xd710-23$ | $5.3\xd710-24$ | $1.4\xd710-22$ | $2.4\xd710-33$ | $4.4\xd710-51$ | |

$b$ | $R2$ | 0.0008 | 0.0027 | 0.0032 | 0.0048 | 0.0160 | 0.0251 |

$p$-value for F-statistic versus Constant | 0.0237 | $1.8\xd710-4$ | $3.9\xd710-5$ | $1.9\xd710-7$ | $4.3\xd710-23$ | $4.0\xd710-36$ |

Heterogeneity . | Network Size . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

$k$ | $R2$ | 0.0018 | 0.0123 | 0.0076 | 0.0149 | 0.0212 | 0.0357 |

$p$-value for F-statistic versus Constant | 0.003 | $6.0\xd710-18$ | $2.2\xd710-11$ | $1.4\xd710-21$ | $1.6\xd710-30$ | $2.9\xd710-51$ | |

$h$ | $R2$ | 0.0199 | 0.0159 | 0.0166 | 0.0156 | 0.0232 | 0.0356 |

$p$-value for F-statistic versus Constant | $1.1\xd710-28$ | $4.6\xd710-23$ | $5.3\xd710-24$ | $1.4\xd710-22$ | $2.4\xd710-33$ | $4.4\xd710-51$ | |

$b$ | $R2$ | 0.0008 | 0.0027 | 0.0032 | 0.0048 | 0.0160 | 0.0251 |

$p$-value for F-statistic versus Constant | 0.0237 | $1.8\xd710-4$ | $3.9\xd710-5$ | $1.9\xd710-7$ | $4.3\xd710-23$ | $4.0\xd710-36$ |

Note: The numbers in bold correspond to quadratic fits where the optimal level of heterogeneity is not in the interior.

## Notes

_{1}

$l1$-norm is the mean of the unweighted absolute value of the difference between the decoded stimuli and the actual stimuli.

_{2}

GLM filter lengths and associated parameters were determined systematically; parameters were varied, and we chose the parameters with largest maximum likelihood. See Figure A1C and accompanying text.

_{3}

Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is the quasi-Newton algorithm used in Matlab.

_{4}

The 180 ms is the time for the last basis vector used for $h$, resulting in a total postspike filter length of $240ms$.

_{5}

ACF and PACF are obtained via the autocorr and parcorr functions in Matlab's Econometrics Toolbox.

## Acknowledgments

C.L. is supported^{13} by a grant from the Simons Foundation (355173). We thank Gary Marsat (WVU) for generously providing us the data from experiments.