## Abstract

Understanding how rich dynamics emerge in neural populations requires models exhibiting a wide range of behaviors while remaining interpretable in terms of connectivity and single-neuron dynamics. However, it has been challenging to fit such mechanistic spiking networks at the single-neuron scale to empirical population data. To close this gap, we propose to fit such data at a mesoscale, using a mechanistic but low-dimensional and, hence, statistically tractable model. The mesoscopic representation is obtained by approximating a population of neurons as multiple homogeneous pools of neurons and modeling the dynamics of the aggregate population activity within each pool. We derive the likelihood of both single-neuron and connectivity parameters given this activity, which can then be used to optimize parameters by gradient ascent on the log likelihood or perform Bayesian inference using Markov chain Monte Carlo (MCMC) sampling. We illustrate this approach using a model of generalized integrate-and-fire neurons for which mesoscopic dynamics have been previously derived and show that both single-neuron and connectivity parameters can be recovered from simulated data. In particular, our inference method extracts posterior correlations between model parameters, which define parameter subsets able to reproduce the data. We compute the Bayesian posterior for combinations of parameters using MCMC sampling and investigate how the approximations inherent in a mesoscopic population model affect the accuracy of the inferred single-neuron parameters.

## 1 Introduction

Neuron populations produce a wide array of complex collective dynamics. Explaining how these emerge requires a mathematical model that not only embodies the network interactions but is also parameterized in terms of interpretable neuron properties. Just as crucial, in order to draw data-supported conclusions, we also need to be able to infer those parameters from empirical observations. These requirements tend to involve a trade-off between model expressiveness and tractability. Low-dimensional state-space models (Macke et al., 2011; Pandarinath et al., 2018; Pillow et al., 2008; Zhao & Park, 2016) are simple enough to allow for inference but achieve that simplicity by focusing on phenomenology: any mechanistic link to the individual neurons is ignored. Conversely, microscopic mechanistic models with thousands of simulated neurons do provide that link between parameters and output (Hawrylycz et al., 2016; Potjans & Diesmann, 2014); however, this complexity makes the analysis difficult and is limited to networks with highly simplified architectures (Doiron, Litwin-Kumar, Rosenbaum, Ocker, & Josic, 2016; Martí, Brunel, & Ostojic, 2018). Since methods to fit these models to experimental data are limited to single neurons (Mensi et al., 2012), it is also unclear how to set their parameters such that they capture the dynamics of large, heterogeneous neural populations.

To reduce the problem to a manageable size and scale, one can consider models that provide a mesoscopic dynamical description founded on microscopic single-neuron dynamics (Dumont, Payeur, & Longtin, 2017; Nykamp & Tranchina, 2000; Wallace, Benayoun, van Drongelen, & Cowan, 2011). Specifically, we will focus on the model described in Schwalger, Deger, and Gerstner (2017), where neurons are grouped into putative excitatory (E) and inhibitory (I) populations in a cortical column. The key approximation is to replace each population with another of equal size but composed of identical neurons, resulting in an effective mesoscopic model of homogeneous populations. In contrast with previous work on population rate dynamics (Gerstner, 2000; Nykamp & Tranchina, 2000; Wilson & Cowan, 1972), Schwalger et al. (2017) correct their mean-field approximations for the finite size of populations. They are thus able to provide stochastic equations for the firing rate of each population with explicit dependence on the population sizes, neuron parameters, and connectivities between populations (see Figure 1A, top). We use these equations to fit the model to traces of population activity.

Directly inferring mesoscopic model parameters has a number of advantages compared to extrapolating from those obtained by fitting a microscopic model. For one, it allows the use of data that do not have single-neuron resolution. In addition, since neuron parameters in a mesoscopic model represent a whole population, there may not be a clear way to relate micro- and mesoscopic parameters if the former are heterogeneous. By inferring population parameters from population recordings, we target the values that best compensate for the mismatch between the data and the idealized mesoscopic model (see Figure 1B).

The method we present assumes that the model to be inferred can be expressed as a set of stochastic equations and that we have access to time series for both the observed (and possibly aggregated) neural activities and external input. It is thus not limited to mesoscale models and could also be applied to, say, Hodgkin Huxley-type neurons in isolation or networks. Nevertheless, in this article, the underlying microscopic model does make the inferred parameters more readily interpretable and provides a good idea of what values an inference algorithm should find for the parameters.

Methods have recently been developed for inferring models where stochastic equations are treated as a black box simulator (Greenberg, Nonnenmacher, & Macke, 2019; Lueckmann et al., 2017; Papamakarios & Murray, 2016; Papamakarios, Sterratt, & Murray, 2018). In such a case, one does not have access to the internal variables of the model and thus cannot compute the likelihood of its parameters; instead, these methods make use of repeated simulations to find suitable parameters. While this makes them applicable to a wide range of models, the repeated simulations can make them computationally expensive and best suited to optimizing a set of statistical features rather than full-time traces. Moreover, for the models of interest here, the likelihood can be derived from the stochastic evolution equations.

We show in this work that the likelihood can indeed be used to infer model parameters using non-convex optimization. The resulting optimization problem shares many similarities with training recurrent neural networks (RNNs) popular in machine learning (Goodfellow, Bengio, & Courville, 2016; Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989) and allows us to leverage optimization tools from that field. However, RNNs in machine learning are typically based on generic, nonmechanistic models, which implies that interpretation of the resulting network can be challenging (but see work on RNN visualization by Barak et al. (Barak, 2017; Haviv, Rivkind, & Barak, 2019; Sussillo & Barak, 2012). Thus, our approach can be regarded as complementary to RNN approaches, as we directly fit a mechanistically interpretable model.

This article is organized as follows. In sections 2.1 and 2.2, we establish that maximum likelihood inference for our chosen mesoscopic model is sound, and in section 2.3 we provide empirical estimates for the amount of data this procedure requires. Using the example of heterogeneous populations, section 2.4 then shows how inference can find effective parameters, which compensate for the mismatch between data and model. In section 2.5 we identify codependence between multiple model parameters by recovering the full Bayesian posterior. Finally, section 2.6 demonstrates that the approach scales well by considering a more challenging four-population model with 36 free parameters. Section 3 discusses our results, with an emphasis on circumscribing the class of models amenable to our approach. Method details are provided in section 4, along with technical insights gained as we adapted likelihood inference to a detailed dynamical model. Additional details, including a full specification of parameter values used throughout the article, are given in appendixes A to I.

## 2 Results

### 2.1 Model Summary

We studied the pair of microscopic and mesoscopic models presented in Schwalger et al. (2017), which is designed to represent excitatory (E) and inhibitory (I) populations of a putative cortical column of four neural layers (Potjans & Diesmann, 2014). For this study, we considered only layers 2/3 and 4 and made minor parameter adjustments to maintain realistic firing rates (see appendix A). We also reduced all population sizes by a factor of 50 to ease the simulation of the microscopic model. This increases the variance of population activities and so does not artificially simplify the task of inferring mesoscopic parameters.

*escape noise*. If all parameters are shared across all neurons within each population, we call this a

*homogeneous*microscopic model. Conversely, we call a model

*heterogeneous*if at least one parameter is unique to each neuron. We denote $I\alpha $ the set of indices for neurons belonging to a population $\alpha $.

*expected activity*$a\alpha $ of a population $\alpha $ is the normalized expected number of spikes,

*activity*$A\alpha $ of that population is a random variable corresponding to the number of spikes actually observed:

Variable . | Definition . |
---|---|

$N\alpha $ | Number of neurons in population $\alpha $ |

$M$ | Number of populations, $\alpha =1,\u2026,M$ |

$L$ | Number of time steps used to compute the likelihood |

$\Delta t$ | Time step |

$I\alpha $ | Set of indices of neurons belonging to population $\alpha $ |

$si(t)$ | 1 if neuron $i$ spiked within time window $[t,t+\Delta t)$, 0 otherwise |

$A\alpha (t)$ | Activity in population $\alpha $ averaged over time window $[t,t+\Delta t)$ |

$a\alpha (t)$ | Expectation of $A(t)$ conditioned on ${A(t')}t'<t$ |

Variable . | Definition . |
---|---|

$N\alpha $ | Number of neurons in population $\alpha $ |

$M$ | Number of populations, $\alpha =1,\u2026,M$ |

$L$ | Number of time steps used to compute the likelihood |

$\Delta t$ | Time step |

$I\alpha $ | Set of indices of neurons belonging to population $\alpha $ |

$si(t)$ | 1 if neuron $i$ spiked within time window $[t,t+\Delta t)$, 0 otherwise |

$A\alpha (t)$ | Activity in population $\alpha $ averaged over time window $[t,t+\Delta t)$ |

$a\alpha (t)$ | Expectation of $A(t)$ conditioned on ${A(t')}t'<t$ |

*trial-averaged activity*,

In addition to homogeneity of populations and independence of spikes within a time step, the mesoscopic model depends on one more key approximation: that neuron populations can be treated as *quasi-renewal* (Naud & Gerstner, 2012; Schwalger et al., 2017). If neurons are viewed as having both refractory and adaptation dynamics, this is roughly equivalent to requiring that the latter be either slow or weak with respect to the former. (A typical example where this approximation does not hold is bursting neurons; see Naud & Gerstner, 2012.) Under these approximations, the unbounded history $Htk$ can be replaced by a finite state vector $S(k)$, which is updated along with the expected activity $a(t)$ (see section 4.2). Since the update equations depend only on $S(k-1)$, they are then Markovian in $S$. This in turn allows the probability of observations $PA(L),A(L-1),\u2026,A(1)$ to be factorized as $PA(L)|S(L)\xb7PA(L-1)|S(L-1)\cdots PA(1)|S(1)$, which is key to making the inference problem tractable.

### 2.2 Recovering Population Model Parameters

We first consider a two-population model composed of E and I neurons. We use the homogeneous microscopic model to generate activity traces (see Figure 2A), with a frozen noise input, which is shared within populations; this input is sine-modulated to provide longer-term fluctuations (see equation 4.28). A maximum a posteriori (MAP) estimate $\eta ^MAP$ of 14 model parameters is then obtained by performing stochastic gradient descent on the posterior (see section 4). Because the likelihood is nonconvex, we perform multiple fits, initializing each one by sampling from the prior (see Figure 2B). We then keep the one that achieves the highest likelihood, which in practice is often sufficient to find a near-global optimum (Meyer, Williamson, Linden, & Sahani, 2017).

. | Number of Components . | . | |
---|---|---|---|

Parameter . | Two-Population Model . | Four-Population Model . | Description . |

$p$ | 4 | 16 | Connection probability |

$w$ | 4 | 16 | Connection weight |

$\Delta $ | 4 | 16 | Transmission delay |

$N$ | 2 | 4 | Number of Neurons in population |

$R$ | 2 | 4 | Membrane resistance |

$urest$ | 2 | 4 | Membrane resting potential |

$\tau m$ | 2 | 4 | Membrane time constant |

$tref$ | 2 | 4 | Absolute refractory period |

$uth$ | 2 | 4 | Nonadapting threshold |

$ur$ | 2 | 4 | Reset potential |

$c$ | 2 | 4 | Escape rate at threshold |

$\Delta u$ | 2 | 4 | Noise level |

$\tau s$ | 2 | 4 | Synaptic time constant |

$J\theta $ | 1 | 2 | Adaptation strength |

$\tau \theta $ | 1 | 2 | Adaptation time constant |

. | Number of Components . | . | |
---|---|---|---|

Parameter . | Two-Population Model . | Four-Population Model . | Description . |

$p$ | 4 | 16 | Connection probability |

$w$ | 4 | 16 | Connection weight |

$\Delta $ | 4 | 16 | Transmission delay |

$N$ | 2 | 4 | Number of Neurons in population |

$R$ | 2 | 4 | Membrane resistance |

$urest$ | 2 | 4 | Membrane resting potential |

$\tau m$ | 2 | 4 | Membrane time constant |

$tref$ | 2 | 4 | Absolute refractory period |

$uth$ | 2 | 4 | Nonadapting threshold |

$ur$ | 2 | 4 | Reset potential |

$c$ | 2 | 4 | Escape rate at threshold |

$\Delta u$ | 2 | 4 | Noise level |

$\tau s$ | 2 | 4 | Synaptic time constant |

$J\theta $ | 1 | 2 | Adaptation strength |

$\tau \theta $ | 1 | 2 | Adaptation time constant |

Notes: For the mesoscopic populations, the ensemble of neuron parameters is replaced by a single effective value for that population. For each parameter, we indicate the number of components in the two- and four-population models; adaptation parameters have fewer components because the model assumes no adaptation for inhibitory neurons. Boldface is used to indicate inferred parameters; the remainder are fixed to the known ground truth values listed in Table 7. This results in, respectively, 14 and 36 free parameters for the two- and four-population models. A brief discussion of how we chose which parameters to infer is given at the end of section 4.5.

We tested the inferred model on frozen low-pass-filtered white noise of the same form as in Augustin, Ladenbauer, Baumann, and Obermayer (2017); see Figure 2C, top), ensuring that a range of relevant time scales are tested. Despite the frozen input, variability between realizations does remain: for the GIF model this is due to sampling the escape noise (equation 2.1), while for the mesoGIF model, it is due to sampling the binomial in equation 2.4. We thus compare models based on the statistics of their response rather than single realizations: each model is simulated 100 times with different internal noise sequences (for each neuron in the case of the GIF model, and for each population in the case of the mesoGIF model) to produce an ensemble of realizations, from which we estimate the time-dependent mean and standard deviation of $A(t)$. Mean and standard deviation are then averaged over disjoint 10 ms windows to reduce variability due to the finite number of realizations. The results are reported as respectively, lines and shading in Figure 2C, and show agreement between true and inferred models; we also find good agreement in the power spectrum of the response to constant input (see Figure 3). Parameterizations for the training and test inputs are given in section 4.8 and the full set of fits is shown in Figure 12.

### 2.3 Quantifying Data Requirements

While simulated data can be relatively cheap and easy to obtain, this is rarely the case of experimental data. An important question therefore is the amount required to infer the parameters of a model. To this end, we quantify in Figure 4 the accuracy of the inferred dynamics as a function of the amount of data.

In order to be certain our ground-truth parameters were exact, for this section we used the mesoGIF for both data generation and inference. This allows us to quantify the error on the inferred parameters, rather than just on the inferred dynamics. In a more realistic setting, data and model are not perfectly matched, and this will likely affect data requirements. Testing and training were done with different external inputs to avoid overfitting; as in section 2.2, we used a sinusoidal frozen white noise for training and a low-pass-filtered frozen white noise for testing. During training, E and I neurons had respective average firing rates of 5.9 and 8.4 Hz, which translates to approximately 3500 spikes per second for the whole population.

We measured the accuracy of inferred dynamics by simulating the model with both the ground truth and inferred parameters, generating 20 different realizations for each model. These were used to calculate both the per trial and trial-averaged Pearson correlation ($\rho $, $\rho \xaf$) and root-mean-square error ($RMSE$, $RMSE\xaf$) between models. An additional 20 simulations of the ground-truth model were used to estimate the best achievable performance for each measure. For per trial measures, the reported standard deviation provides an estimate of the variability between realizations; for trial-averaged measures, the standard deviation is obtained by bootstrapping, and is purely an uncertainty on the statistic (it vanishes in the limit of large number of realizations). The calculations for these measures are fully described in section 4.7. In subsequent section, we report only the values of $\rho \xaf,RMSE$ to avoid redundant information.

Consistent with the observations of Augustin et al. (2017), we found that $\rho $ (in contrast to $\rho \xaf$) does not allow differentiating between models close to ground truth. The $RMSE$ and $RMSE\xaf$, on the other hand, showed similar sensitivity but may be unreliable far from ground truth (as evidenced by the data point at $L=1.25$ s in Figure 4C). Since the per trial $RMSE$ additionally quantifies the variability between realizations (through its standard deviation), we preferred it over its trial-averaged analog.

As we would expect, the inferred model better reproduces the dynamics of the true model when the amount of data is increased (see Figure 4); when fitting all 14 parameters of the mesoGIF model, the inferred model no longer improves when more than 5 s to 7 s of data are provided (see Figures 4B and 4C) – corresponding to a total of about 17,500 to 24,500 spikes. In appendix D, we repeat the test described here with smaller parameter sets (achieved by clamping certain parameters to their known ground-truth values). We find that this has only a modest effect on the achieved performance, but does significantly improve the consistency of fits (compare Figures 9B and 9C). Inferring larger parameter sets is thus expected to require more fits (and consequent computation time) before a few of them find the MAP. Certain parameters are also more difficult to infer: for the case shown in Figure 4, relative errors on the inferred parameters range from 5% to 22% (see appendix D, Table 12). Parameters describing the inhibitory population ($\tau m,I$, $wIE$, $wII$) show the highest relative error, as well as the escape rates ($cE$, $cI$) and the adaptation time constant ($\tau \theta ,E$).

Fitting Parameter . | Value . | Comment . |
---|---|---|

Learning rate | 0.01 | Adam parameter |

$\beta 1$ | 0.1 | Adam parameter |

$\beta 2$ | 0.001 | Adam parameter |

$gclip$ | 100 | Clipping threshold |

$Lburnin$ | 10 s | Data burn-in |

$Bburnin$ | 0.3 s | Minibatch burn-in |

$\gamma L$ | 1 | $Lburnin$ noise factor |

$\gamma B$ | 0.1 | $Bburnin$ noise factor |

Fitting Parameter . | Value . | Comment . |
---|---|---|

Learning rate | 0.01 | Adam parameter |

$\beta 1$ | 0.1 | Adam parameter |

$\beta 2$ | 0.001 | Adam parameter |

$gclip$ | 100 | Clipping threshold |

$Lburnin$ | 10 s | Data burn-in |

$Bburnin$ | 0.3 s | Minibatch burn-in |

$\gamma L$ | 1 | $Lburnin$ noise factor |

$\gamma B$ | 0.1 | $Bburnin$ noise factor |

Note: Learning rate, $\beta 1$ and $\beta 2$ are as defined in Kingma and Ba (2014).

Algorithm . | HamiltonianMC (PyMC3; Salvatier et al., 2016) . |
---|---|

Step scale | 0.0025 |

Path length | 0.1 |

Tuning steps | 20 |

Initialization | jitter+adapt_diag |

Start | $\eta MAP$ estimate |

Number of samples | 2000 |

Total run time | 201 h |

Algorithm . | HamiltonianMC (PyMC3; Salvatier et al., 2016) . |
---|---|

Step scale | 0.0025 |

Path length | 0.1 |

Tuning steps | 20 |

Initialization | jitter+adapt_diag |

Start | $\eta MAP$ estimate |

Number of samples | 2000 |

Total run time | 201 h |

. | Two-Population Model . | Four-Population Model . | . | ||||
---|---|---|---|---|---|---|---|

. | E . | I . | L2/3e . | L2/3i . | L4e . | L4i . | Unit . |

$B$ | 0.25 | 0.1 | 0.0 | 0.0 | 0.25 | 0.1 | mA |

$\omega $ | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | – |

$q$ | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | mA |

. | Two-Population Model . | Four-Population Model . | . | ||||
---|---|---|---|---|---|---|---|

. | E . | I . | L2/3e . | L2/3i . | L4e . | L4i . | Unit . |

$B$ | 0.25 | 0.1 | 0.0 | 0.0 | 0.25 | 0.1 | mA |

$\omega $ | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | – |

$q$ | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | mA |

. | Two-Population Model . | Four-Population Model . | Unit . | ||||
---|---|---|---|---|---|---|---|

. | E . | I . | L2/3e . | L2/3i . | L4e . | L4i . | . |

$\mu OU$ | 0.1 | 0.05 | 1 | 1 | 0 | 1 | mA |

$\tau OU$ | 1 | 1 | 2 | 2 | – | 2 | s |

$q$ | 0.125 | 0.125 | 0.5 | 0.5 | 0 | 0.5 | mA |

$Itest(0)$ | 0.1 | 0.05 | 0.5 | 0.5 | 0 | 0.5 | mA |

. | Two-Population Model . | Four-Population Model . | Unit . | ||||
---|---|---|---|---|---|---|---|

. | E . | I . | L2/3e . | L2/3i . | L4e . | L4i . | . |

$\mu OU$ | 0.1 | 0.05 | 1 | 1 | 0 | 1 | mA |

$\tau OU$ | 1 | 1 | 2 | 2 | – | 2 | s |

$q$ | 0.125 | 0.125 | 0.5 | 0.5 | 0 | 0.5 | mA |

$Itest(0)$ | 0.1 | 0.05 | 0.5 | 0.5 | 0 | 0.5 | mA |

. | Value . | . | . | |
---|---|---|---|---|

Parameter . | Two-Population Model . | Four-Population Model . | Unit . | Prior Distribution . |

Population labels | E, I | L2/3e, L2/3i, L4e, L4i | ||

$N$ | $(438,109)$ | $(413,116,438,109)$ | ||

$R$ | $(19,11.964)$ | $(0,0,19,11.964)$ | $\Omega $ | |

$urest$ | $(20,19.5)$ | $(18,18,25,20)$ | mV | |

$p$ | $0.04970.13500.07940.1597$ | $0.10090.16890.04370.08180.13460.13710.03160.05150.00770.00590.04970.13500.06910.00290.07940.1597$ | ||

$w$ | $2.482-4.9641.245-4.964$ | $1.245-4.9641.245-4.9641.245-4.9641.245-4.9641.245-4.9642.482-4.9641.245-4.9641.245-4.964$ | mV | $w$$\u223cN0,42$ |

$\tau m$ | (0.01, 0.01) | (0.01, 0.01, 0.01, 0.01) | s | $log10\tau m\u223cN(-2,22)$ |

$tref$ | (0.002, 0.002) | (0.002, 0.002, 0.002, 0.002) | s | |

$uth$ | (15, 15) | (15, 15, 15, 15) | mV | $uth$$\u223cN(15,102)$ |

$ur$ | (0, 0) | (0, 0, 0, 0) | mV | $ur$$\u223cN(0,102)$ |

$c$ | (10, 10) | (10, 10, 10, 10) | Hz | $c$$\u223c\Gamma (2,5)$ |

$\Delta u$ | (5, 5) | (5, 5, 5, 5) | mV | $\Delta u$$\u223c\Gamma (3,1.5)$ |

$\Delta $ | 0.001 | 0.001 | s | |

$\tau s$ | (0.003, 0.006) | (0.003, 0.006, 0.003, 0.006) | s | $log10\tau s$$\u223cN(-3,32)$ |

$J\theta $ | (1.0, 0) | (1.0, 0, 1.0, 0) | mV | $J\theta $$\u223c\Gamma (2,0.5)$ |

$\tau \theta $ | (1.0, –) | (1.0, –, 1.0, –) | s | $log10\tau \theta $$\u223cN(-1,52)$ |

. | Value . | . | . | |
---|---|---|---|---|

Parameter . | Two-Population Model . | Four-Population Model . | Unit . | Prior Distribution . |

Population labels | E, I | L2/3e, L2/3i, L4e, L4i | ||

$N$ | $(438,109)$ | $(413,116,438,109)$ | ||

$R$ | $(19,11.964)$ | $(0,0,19,11.964)$ | $\Omega $ | |

$urest$ | $(20,19.5)$ | $(18,18,25,20)$ | mV | |

$p$ | $0.04970.13500.07940.1597$ | $0.10090.16890.04370.08180.13460.13710.03160.05150.00770.00590.04970.13500.06910.00290.07940.1597$ | ||

$w$ | $2.482-4.9641.245-4.964$ | $1.245-4.9641.245-4.9641.245-4.9641.245-4.9641.245-4.9642.482-4.9641.245-4.9641.245-4.964$ | mV | $w$$\u223cN0,42$ |

$\tau m$ | (0.01, 0.01) | (0.01, 0.01, 0.01, 0.01) | s | $log10\tau m\u223cN(-2,22)$ |

$tref$ | (0.002, 0.002) | (0.002, 0.002, 0.002, 0.002) | s | |

$uth$ | (15, 15) | (15, 15, 15, 15) | mV | $uth$$\u223cN(15,102)$ |

$ur$ | (0, 0) | (0, 0, 0, 0) | mV | $ur$$\u223cN(0,102)$ |

$c$ | (10, 10) | (10, 10, 10, 10) | Hz | $c$$\u223c\Gamma (2,5)$ |

$\Delta u$ | (5, 5) | (5, 5, 5, 5) | mV | $\Delta u$$\u223c\Gamma (3,1.5)$ |

$\Delta $ | 0.001 | 0.001 | s | |

$\tau s$ | (0.003, 0.006) | (0.003, 0.006, 0.003, 0.006) | s | $log10\tau s$$\u223cN(-3,32)$ |

$J\theta $ | (1.0, 0) | (1.0, 0, 1.0, 0) | mV | $J\theta $$\u223c\Gamma (2,0.5)$ |

$\tau \theta $ | (1.0, –) | (1.0, –, 1.0, –) | s | $log10\tau \theta $$\u223cN(-1,52)$ |

Notes: Symbols are the same as in Table 2. Most values are those given in Tables 1 and 2 of Schwalger et al. (2017). Priors are given as scalar distributions because they are the same for all components. The pdf of $\Gamma (\alpha ,\theta )$ is $x\alpha -1e-x/\theta \theta \alpha \Gamma (\alpha )$.

. | Distribution Parameter . | |
---|---|---|

Heterogeneous Model Parameter . | $\mu $ . | $\sigma $ . |

$log10\tau m,E$ | $-$1.6 | 0.5 |

$log10\tau m,I$ | $-$1.8 | 0.5 |

$log10\tau \theta ,E$ | $-$0.7 | 0.5 |

. | Distribution Parameter . | |
---|---|---|

Heterogeneous Model Parameter . | $\mu $ . | $\sigma $ . |

$log10\tau m,E$ | $-$1.6 | 0.5 |

$log10\tau m,I$ | $-$1.8 | 0.5 |

$log10\tau \theta ,E$ | $-$0.7 | 0.5 |

Notes: Each parameter was sampled from a log-normal distribution $log10N(\mu ,\sigma 2)$ with mean $\mu $ and variance $\sigma 2$. No adaptation was modeled in the inhibitory population, so $\tau \theta ,I$ was not sampled.

Parameter . | Inferred Value . | Average Heterogeneous Value . | Unit . |
---|---|---|---|

$w$ | $1.59-5.050.73-3.43$ | $2.482-4.9641.245-4.964$ | mV |

$\tau m$ | (0.011, 0.008) | (0.056, 0.046) | s |

$c$ | (5.05, 5.22) | (10, 10) | Hz |

$\Delta u$ | (5.09, 4.09) | (5, 5) | mV |

$\tau s$ | (0.0046, 0.0109) | (0.003, 0.006) | s |

$J\theta $ | (0.538, 0) | (1.0, 0) | mV |

$\tau \theta $ | (0.131, –) | (0.380, –) | s |

Parameter . | Inferred Value . | Average Heterogeneous Value . | Unit . |
---|---|---|---|

$w$ | $1.59-5.050.73-3.43$ | $2.482-4.9641.245-4.964$ | mV |

$\tau m$ | (0.011, 0.008) | (0.056, 0.046) | s |

$c$ | (5.05, 5.22) | (10, 10) | Hz |

$\Delta u$ | (5.09, 4.09) | (5, 5) | mV |

$\tau s$ | (0.0046, 0.0109) | (0.003, 0.006) | s |

$J\theta $ | (0.538, 0) | (1.0, 0) | mV |

$\tau \theta $ | (0.131, –) | (0.380, –) | s |

Notes: Values are given in vector format, as $(\eta E,\eta I)$. Corresponding average values for the heterogeneous microscopic model are given for comparison. (The heterogeneous model was homogeneous in all parameters except $\tau m$ and $\tau \theta $.)

. | MAP . | Theory . | ||||||
---|---|---|---|---|---|---|---|---|

. | $L2/3e$ . | $L2/3i$ . | $L4e$ . | $L4i$ . | $L2/3e$ . | $L2/3i$ . | $L4e$ . | $L4i$ . |

$wL2/3e\u2190\xb7$ | 0.734 | $-$5.629 | 1.546 | $-$5.292 | 1.245 | $-$4.964 | 1.245 | $-$4.964 |

$wL2/3i\u2190\xb7$ | 1.181 | $-$5.406 | 1.419 | $-$4.294 | 1.245 | $-$4.964 | 1.245 | $-$4.964 |

$wL4e\u2190\xb7$ | 1.528 | $-$0.637 | 2.058 | $-$4.213 | 1.245 | $-$4.964 | 2.482 | $-$4.964 |

$wL4i\u2190\xb7$ | 0.174 | 1.112 | 1.046 | $-$3.994 | 1.245 | $-$4.964 | 1.245 | $-$4.964 |

$\tau m$ | 0.016 | 0.015 | 0.008 | 0.009 | 0.010 | 0.010 | 0.010 | 0.010 |

$c$ | 16.717 | 18.170 | 9.020 | 9.680 | 10.000 | 10.000 | 10.000 | 10.000 |

$\Delta u$ | 7.435 | 6.453 | 4.750 | 4.420 | 5.000 | 5.000 | 5.000 | 5.000 |

$\tau s$ | 0.001 | 0.006 | 0.002 | 0.009 | 0.003 | 0.006 | 0.003 | 0.006 |

$J\theta $ | 0.232 | — | 0.967 | — | 1.000 | — | 1.000 | — |

$\tau \theta $ | 0.425 | — | 1.596 | — | 1.000 | — | 1.000 | — |

. | MAP . | Theory . | ||||||
---|---|---|---|---|---|---|---|---|

. | $L2/3e$ . | $L2/3i$ . | $L4e$ . | $L4i$ . | $L2/3e$ . | $L2/3i$ . | $L4e$ . | $L4i$ . |

$wL2/3e\u2190\xb7$ | 0.734 | $-$5.629 | 1.546 | $-$5.292 | 1.245 | $-$4.964 | 1.245 | $-$4.964 |

$wL2/3i\u2190\xb7$ | 1.181 | $-$5.406 | 1.419 | $-$4.294 | 1.245 | $-$4.964 | 1.245 | $-$4.964 |

$wL4e\u2190\xb7$ | 1.528 | $-$0.637 | 2.058 | $-$4.213 | 1.245 | $-$4.964 | 2.482 | $-$4.964 |

$wL4i\u2190\xb7$ | 0.174 | 1.112 | 1.046 | $-$3.994 | 1.245 | $-$4.964 | 1.245 | $-$4.964 |

$\tau m$ | 0.016 | 0.015 | 0.008 | 0.009 | 0.010 | 0.010 | 0.010 | 0.010 |

$c$ | 16.717 | 18.170 | 9.020 | 9.680 | 10.000 | 10.000 | 10.000 | 10.000 |

$\Delta u$ | 7.435 | 6.453 | 4.750 | 4.420 | 5.000 | 5.000 | 5.000 | 5.000 |

$\tau s$ | 0.001 | 0.006 | 0.002 | 0.009 | 0.003 | 0.006 | 0.003 | 0.006 |

$J\theta $ | 0.232 | — | 0.967 | — | 1.000 | — | 1.000 | — |

$\tau \theta $ | 0.425 | — | 1.596 | — | 1.000 | — | 1.000 | — |

Notes: The values for the homogeneous microscopic model used in Figures 7 and 8 are listed on the right. Theory predicts these to be the best parameterizations for the mesoscopic model and thus should be recovered by maximizing the posterior (MAP values). Since L2/3 receives no external input in the training data, the inferred parameters for those populations are understandably further from theory.

Subset Label . | Included Parameters . |
---|---|

$\eta 1$ | ${wEE,wEI,wIE,wII}$ |

$\eta 2$ | ${\tau \theta ,E,J\theta ,E}$ |

$\eta 3$ | $\eta 1\u222a\eta 3\u222a{cE,cI,\Delta u,E,\Delta u,I,\tau m,E,\tau m,I,\tau s,E,\tau s,I}$ |

Subset Label . | Included Parameters . |
---|---|

$\eta 1$ | ${wEE,wEI,wIE,wII}$ |

$\eta 2$ | ${\tau \theta ,E,J\theta ,E}$ |

$\eta 3$ | $\eta 1\u222a\eta 3\u222a{cE,cI,\Delta u,E,\Delta u,I,\tau m,E,\tau m,I,\tau s,E,\tau s,I}$ |

Note: There are only two adaptation parameters because inhibitory populations have no adaptation in this model.

. | . | L . | |||||
---|---|---|---|---|---|---|---|

Subset . | Parameter . | 1.25 . | 2.00 . | 3.00 . | 5.00 . | 7.00 . | 9.00 . |

$\eta 1$ | $wEE$ | 0.047 | 0.023 | 0.045 | 0.034 | 0.029 | 0.027 |

$wEI$ | 0.040 | 0.018 | 0.046 | 0.033 | 0.035 | 0.032 | |

$wIE$ | 0.013 | 0.038 | 0.000 | 0.001 | 0.005 | 0.024 | |

$wII$ | 0.018 | 0.005 | 0.022 | 0.018 | 0.012 | 0.005 | |

$\eta 2$ | $J\theta ,E$ | 0.002 | 0.000 | 0.011 | 0.004 | 0.010 | 0.009 |

$\tau \theta ,E$ | 0.283 | 0.370 | 0.009 | 0.108 | 0.030 | 0.045 | |

$\eta 3$ | $wEE$ | 0.345 | 0.348 | 0.151 | 0.001 | 0.084 | 0.067 |

$wEI$ | 0.238 | 0.043 | 0.079 | 0.132 | 0.067 | 0.072 | |

$wIE$ | 0.017 | 0.556 | 0.630 | 0.427 | 0.244 | 0.178 | |

$wII$ | 0.070 | 0.495 | 0.503 | 0.326 | 0.180 | 0.136 | |

$J\theta ,E$ | 0.016 | 0.094 | 0.369 | 0.385 | 0.092 | 0.016 | |

$\tau \theta ,E$ | 0.267 | 0.054 | 0.450 | 0.586 | 0.248 | 0.213 | |

$cE$ | 0.420 | 0.376 | 0.469 | 0.382 | 0.239 | 0.160 | |

$cI$ | 2.825 | 0.006 | 0.133 | 0.142 | 0.128 | 0.161 | |

$\Delta uE$ | 0.215 | 0.182 | 0.090 | 0.086 | 0.092 | 0.052 | |

$\Delta uI$ | 0.520 | 0.338 | 0.466 | 0.302 | 0.129 | 0.058 | |

$\tau m,E$ | 0.190 | 0.117 | 0.057 | 0.177 | 0.052 | 0.037 | |

$\tau m,I$ | 2.590 | 0.619 | 0.430 | 0.393 | 0.235 | 0.219 | |

$\tau s,E$ | 0.744 | 0.101 | 0.038 | 0.101 | 0.039 | 0.142 | |

$\tau s,I$ | 0.271 | 0.119 | 0.138 | 0.132 | 0.081 | 0.081 |

. | . | L . | |||||
---|---|---|---|---|---|---|---|

Subset . | Parameter . | 1.25 . | 2.00 . | 3.00 . | 5.00 . | 7.00 . | 9.00 . |

$\eta 1$ | $wEE$ | 0.047 | 0.023 | 0.045 | 0.034 | 0.029 | 0.027 |

$wEI$ | 0.040 | 0.018 | 0.046 | 0.033 | 0.035 | 0.032 | |

$wIE$ | 0.013 | 0.038 | 0.000 | 0.001 | 0.005 | 0.024 | |

$wII$ | 0.018 | 0.005 | 0.022 | 0.018 | 0.012 | 0.005 | |

$\eta 2$ | $J\theta ,E$ | 0.002 | 0.000 | 0.011 | 0.004 | 0.010 | 0.009 |

$\tau \theta ,E$ | 0.283 | 0.370 | 0.009 | 0.108 | 0.030 | 0.045 | |

$\eta 3$ | $wEE$ | 0.345 | 0.348 | 0.151 | 0.001 | 0.084 | 0.067 |

$wEI$ | 0.238 | 0.043 | 0.079 | 0.132 | 0.067 | 0.072 | |

$wIE$ | 0.017 | 0.556 | 0.630 | 0.427 | 0.244 | 0.178 | |

$wII$ | 0.070 | 0.495 | 0.503 | 0.326 | 0.180 | 0.136 | |

$J\theta ,E$ | 0.016 | 0.094 | 0.369 | 0.385 | 0.092 | 0.016 | |

$\tau \theta ,E$ | 0.267 | 0.054 | 0.450 | 0.586 | 0.248 | 0.213 | |

$cE$ | 0.420 | 0.376 | 0.469 | 0.382 | 0.239 | 0.160 | |

$cI$ | 2.825 | 0.006 | 0.133 | 0.142 | 0.128 | 0.161 | |

$\Delta uE$ | 0.215 | 0.182 | 0.090 | 0.086 | 0.092 | 0.052 | |

$\Delta uI$ | 0.520 | 0.338 | 0.466 | 0.302 | 0.129 | 0.058 | |

$\tau m,E$ | 0.190 | 0.117 | 0.057 | 0.177 | 0.052 | 0.037 | |

$\tau m,I$ | 2.590 | 0.619 | 0.430 | 0.393 | 0.235 | 0.219 | |

$\tau s,E$ | 0.744 | 0.101 | 0.038 | 0.101 | 0.039 | 0.142 | |

$\tau s,I$ | 0.271 | 0.119 | 0.138 | 0.132 | 0.081 | 0.081 |

### 2.4 Modeling High-Dimensional Heterogeneous Populations with an Effective Low-Dimensional Homogeneous Model

A frequently understated challenge of meso- and macroscale population models is that of choosing their parameters such that the dynamics of the modeled neuron populations are consistent with the high-dimensional dynamics of networks of individual neurons. A typical approach, when measurements of microscopic single-neuron parameters are available, is to assign each parameter its mean across the population (Gerstner, Paninski, Naud, & Kistler, 2014, sec. 12). However, as alluded to in section 1, mean parameters do not always make good predictors for nonlinear systems; this is evidenced by Figure 5, which expands on Figure 1B.

An alternative approach would be to fit the population model to observed population activities, such as to ensure maximum consistency with data—for example, by finding the maximum a posteriori (MAP) parameters. In this way, we obtain effective parameters that compensate for the mismatch between data and population model.

To show that this can work, we made the microscopic model heterogeneous in three parameters: $\tau m,E$, $\tau m,I$, and $\tau \theta ,E$. These parameters were set individually for each neuron by sampling from a log-normal distribution (see Figure 5A and Table 8). As in previous sections, output from the microscopic model under sine-modulated frozen white noise input was then used to train the mesoscopic one. For testing, we used a single step input (see Figure 5B); this allowed us to test the performance of the inferred model in both the transient and steady-state regimes. The per trial $RMSE$ and trial-averaged correlation $\rho \xaf$ were computed on ensembles of realizations, as described in section 4.7.

We considered three sets of parameters for the mesoscopic model. For the first, we set $\tau m,E$, $\tau m,I$, and $\tau \theta ,E$ to their sample averages. This produced rather poor results (see Figure 5C, left); in particular, the transient response to the step is much more dramatic and long-lived than that of the ground-truth model. As the neural model is highly nonlinear in its parameters, linearly averaging parameters is not guaranteed to produce optimal results.

The test results are improved when the heterogeneous parameters are inferred (see Figure 5C, middle). However, fitting only the heterogeneous parameters gives the mesoscopic model only three degrees of freedom to compensate for approximating a heterogeneous model by a homogeneous one, and it still produces traces with a variance that is too high. Indeed, giving the model full freedom over the parameters provides another step improvement (see Figure 5C, right), with output from the mesoscopic model differing from the target output only by a higher transient peak and slightly different mean activities (obtained parameter values are listed in Table 9). Thus while fitting more parameters may incur additional computational cost (see appendix D), it also provides more opportunities to accommodate model mismatch.

The results of this section show the necessity of inferring population parameters rather than simply averaging single-neuron values. It also demonstrates the ability of population models to reproduce realistic activities when we provide them with good effective parameters; in order to compensate for modeling assumptions, those parameters will in general differ from those of a more detailed microscopic model.

### 2.5 Full Posterior Estimation over Parameters

It can often be desirable to know which parameters, or combinations of parameters, are constrained by the data. Bayesian inference, that is, estimation of the posterior distribution over parameters given the data, can be used to not only identify the best-fitting parameters but also to characterize the uncertainty about these estimates. Notably, these uncertainties may be highly correlated across parameters. For instance, one expects an increase in E connectivity to cancel a decrease in (negative) I connectivity to the same population, and this is confirmed by the correlation in the marginals shown in Figure 6A. Interestingly, this correlation is in fact stronger for connectivities sharing the same *target* than those sharing the same *source*. More novel structure can be learned from Figure 6B, such as the strong correlation between the adaptation parameters or the complete absence of correlation between them and the synaptic parameters. In particular, the tight relationship between $J\theta ,E$, $\tau \theta ,E$, and $cE$ suggests that for determining model dynamics, the ratios $J\theta ,E/\tau \theta ,E$ and $J\theta ,E/cE$ may be more important than any of those three quantities individually.

Since there are 14 unknown parameters, the posterior is also 14-dimensional; we represent it by displaying the joint distributions between pairs, obtained by marginalizing out the other 12 parameters (see section 4.6). Training data here were generated in the same way as in section 2.2, from a homogeneous microscopic model with the parameters listed in Table 2. To provide a sense of scale, we have drawn ellipses in Figure 6 to indicate the volume corresponding to two standard deviations from the mean under a gaussian model. In a number of cases, it highlights how the true distribution is nongaussian—for example, the distributions of $cE$, $J\theta ,E$, and $\tau \theta ,E$ are noticeably skewed.

A naive way to compute these 2D marginals would be to numerically integrate the likelihood; however, given that that leaves 12 dimensions to integrate, such an approach would be computationally unfeasible. Instead we used Hamiltonian Monte Carlo (HMC) sampling (Betancourt & Girolami, 2013; Neal, 2012). Monte Carlo methods are guaranteed to asymptotically converge to the true posterior, a valuable feature when one wishes to deduce interactions between parameters from its structure. Nevertheless, due to the complexity of mesoGIF's likelihood, memory and computational cost still required special consideration (see section 4.6).

We note that the 2$\sigma $ ellipses in Figure 6, while informative, are imperfect indicators of the probability mass distribution. If the posterior is gaussian, then each projection to a 2D marginal places 86.5% of the probability mass within the ellipse; however, for nongaussian posteriors, this number can vary substantially. Moreover, the markers for ground-truth parameters shown in Figure 6 may differ from the effective parameters found by the model (see section 2.4).

### 2.6 Pushing the Limits of Generalization

The previous sections have shown that we can recover 14 parameters of the two population mesoGIF model. A natural question is whether this approach scales well to larger models. We investigated this by considering four neuron populations representing the L2/3 and L4 layers of the Potjans-Diesmann microcircuit (Potjans & Diesmann, 2014). The associated higher-dimensional set of mesoscopic equations follows the same form as in previous sections (Schwalger et al., 2017). There are 36 free parameters in this model, of which 16 are connectivities; they are listed in Table 2. Similar to previous sections, we trained mesoGIF on output from the microscopic model with sinusoidal drive (see Figure 7A).

The L4 populations tend to drive the activity in this model, and we found that we do not need to provide any input to the L2/3 neurons to get parameter estimates that accurately predict population activity (see Figure 8, left): the small fluctuations in L2/3 (see Figure 7B) suffice to provide constraints on those population parameters. Those constraints, of course, are somewhat looser, and in particular connection strengths onto L4 are not as well estimated when compared to ground truth (see Table 10).

Pushing the mesoscopic approximation beyond its validity limits using inputs with abrupt transitions understandably increases the discrepancy between ground truth and model (see Figure 8, right). Indeed, such a strong input may cause neurons to fire in bursts, thereby breaking the quasi-renewal approximation (see section 2.1). During an input spike, the true model shows small oscillations; the theoretical mesoGIF reproduces these oscillations but with an exaggerated amplitude and higher variance between realizations, and in contrast to section 2.4, the inferred model does no better. This larger discrepancy with the true model is reflected in the performance measures (see Tables 14 and 15), and is consistent with the observation that the mesoGIF has higher variance during bursts (Schwalger et al., 2017, p. 15). Slower timescale dynamics are still accurately captured by both the theoretical and inferred models.

. | . | L . | |||||
---|---|---|---|---|---|---|---|

Subset . | Parameter . | 1.25 . | 2.00 . | 3.00 . | 5.00 . | 7.00 . | 9.00 . |

$\eta 1$ | $wEE$ | 1.33 | 1.05 | 0.95 | 0.72 | 0.61 | 0.82 |

$wEI$ | 0.69 | 0.63 | 0.43 | 0.62 | 0.40 | 0.48 | |

$wIE$ | 1.38 | 1.48 | 1.53 | 1.75 | 1.90 | 1.36 | |

$wII$ | 0.40 | 0.46 | 0.56 | 0.39 | 0.61 | 0.65 | |

$\eta 2$ | $J\theta ,E$ | 2.56 | 1.49 | 1.50 | 1.13 | 1.13 | 1.42 |

$\tau \theta ,E$ | 7.70 | 10.33 | 8.75 | 11.11 | 5.64 | 6.82 | |

$\eta 3$ | $wEE$ | 31.63 | 31.73 | 29.26 | 37.84 | 38.29 | 31.49 |

$wEI$ | 183.77 | 127.48 | 30.84 | 88.90 | 44.35 | 85.80 | |

$wIE$ | 64.08 | 100.98 | 1833.28 | 1143.47 | 1489.30 | 2405.89 | |

$wII$ | 53.90 | 65.92 | 36.25 | 78.14 | 37.70 | 55.10 | |

$J\theta ,E$ | 94.03 | 82.51 | 30.99 | 59.25 | 40.58 | 53.40 | |

$\tau \theta ,E$ | 70.53 | 329.84 | 255.21 | 209.10 | 282.68 | 160.36 | |

$cE$ | 83.08 | 65.81 | 37.16 | 39.42 | 48.50 | 47.80 | |

$cI$ | 29.31 | 28.84 | 34.10 | 52.49 | 49.66 | 52.50 | |

$\Delta uE$ | 14.98 | 19.16 | 6.11 | 11.37 | 7.58 | 8.84 | |

$\Delta uI$ | 32.84 | 30.33 | 13.28 | 41.22 | 21.86 | 34.30 | |

$\tau m,E$ | 429.43 | 420.71 | 428.10 | 431.68 | 441.05 | 444.64 | |

$\tau m,I$ | 256.36 | 269.36 | 346.42 | 337.09 | 314.64 | 288.72 | |

$\tau s,E$ | 427.66 | 441.61 | 447.11 | 446.85 | 446.84 | 447.20 | |

$\tau s,I$ | 238.26 | 240.74 | 65.92 | 262.20 | 71.13 | 295.76 |

. | . | L . | |||||
---|---|---|---|---|---|---|---|

Subset . | Parameter . | 1.25 . | 2.00 . | 3.00 . | 5.00 . | 7.00 . | 9.00 . |

$\eta 1$ | $wEE$ | 1.33 | 1.05 | 0.95 | 0.72 | 0.61 | 0.82 |

$wEI$ | 0.69 | 0.63 | 0.43 | 0.62 | 0.40 | 0.48 | |

$wIE$ | 1.38 | 1.48 | 1.53 | 1.75 | 1.90 | 1.36 | |

$wII$ | 0.40 | 0.46 | 0.56 | 0.39 | 0.61 | 0.65 | |

$\eta 2$ | $J\theta ,E$ | 2.56 | 1.49 | 1.50 | 1.13 | 1.13 | 1.42 |

$\tau \theta ,E$ | 7.70 | 10.33 | 8.75 | 11.11 | 5.64 | 6.82 | |

$\eta 3$ | $wEE$ | 31.63 | 31.73 | 29.26 | 37.84 | 38.29 | 31.49 |

$wEI$ | 183.77 | 127.48 | 30.84 | 88.90 | 44.35 | 85.80 | |

$wIE$ | 64.08 | 100.98 | 1833.28 | 1143.47 | 1489.30 | 2405.89 | |

$wII$ | 53.90 | 65.92 | 36.25 | 78.14 | 37.70 | 55.10 | |

$J\theta ,E$ | 94.03 | 82.51 | 30.99 | 59.25 | 40.58 | 53.40 | |

$\tau \theta ,E$ | 70.53 | 329.84 | 255.21 | 209.10 | 282.68 | 160.36 | |

$cE$ | 83.08 | 65.81 | 37.16 | 39.42 | 48.50 | 47.80 | |

$cI$ | 29.31 | 28.84 | 34.10 | 52.49 | 49.66 | 52.50 | |

$\Delta uE$ | 14.98 | 19.16 | 6.11 | 11.37 | 7.58 | 8.84 | |

$\Delta uI$ | 32.84 | 30.33 | 13.28 | 41.22 | 21.86 | 34.30 | |

$\tau m,E$ | 429.43 | 420.71 | 428.10 | 431.68 | 441.05 | 444.64 | |

$\tau m,I$ | 256.36 | 269.36 | 346.42 | 337.09 | 314.64 | 288.72 | |

$\tau s,E$ | 427.66 | 441.61 | 447.11 | 446.85 | 446.84 | 447.20 | |

$\tau s,I$ | 238.26 | 240.74 | 65.92 | 262.20 | 71.13 | 295.76 |

. | . | $RMSE$ . | |||
---|---|---|---|---|---|

Input . | Model . | L2/3e . | L2/3i . | L4e . | L4i . |

Sine | True – micro | 1.39 $\xb1$ 0.03 | 3.46 $\xb1$ 0.10 | 3.47 $\xb1$ 0.13 | 4.59 $\xb1$ 0.13 |

Theory – meso | 1.39 $\xb1$ 0.04 | 3.37 $\xb1$ 0.10 | 3.76 $\xb1$ 0.15 | 4.51 $\xb1$ 0.13 | |

MAP – meso | 1.39 $\xb1$ 0.03 | 3.49 $\xb1$ 0.09 | 3.46 $\xb1$ 0.13 | 4.53 $\xb1$ 0.13 | |

OU | True – micro | 1.22 $\xb1$ 0.03 | 3.14 $\xb1$ 0.08 | 2.26 $\xb1$ 0.07 | 5.13 $\xb1$ 0.15 |

Theory – meso | 1.21 $\xb1$ 0.03 | 3.06 $\xb1$ 0.09 | 2.26 $\xb1$ 0.07 | 4.95 $\xb1$ 0.15 | |

MAP – meso | 1.22 $\xb1$ 0.03 | 3.11 $\xb1$ 0.08 | 2.25 $\xb1$ 0.06 | 5.30 $\xb1$ 0.14 | |

Impulse | True – micro | 1.54 $\xb1$ 0.05 | 3.64 $\xb1$ 0.11 | 5.32 $\xb1$ 0.46 | 5.11 $\xb1$ 0.23 |

Theory – meso | 1.59 $\xb1$ 0.06 | 3.63 $\xb1$ 0.11 | 7.99 $\xb1$ 0.66 | 5.88 $\xb1$ 0.36 | |

MAP – meso | 1.70 $\xb1$ 0.07 | 3.96 $\xb1$ 0.13 | 7.74 $\xb1$ 0.59 | 6.00 $\xb1$ 0.36 |

. | . | $RMSE$ . | |||
---|---|---|---|---|---|

Input . | Model . | L2/3e . | L2/3i . | L4e . | L4i . |

Sine | True – micro | 1.39 $\xb1$ 0.03 | 3.46 $\xb1$ 0.10 | 3.47 $\xb1$ 0.13 | 4.59 $\xb1$ 0.13 |

Theory – meso | 1.39 $\xb1$ 0.04 | 3.37 $\xb1$ 0.10 | 3.76 $\xb1$ 0.15 | 4.51 $\xb1$ 0.13 | |

MAP – meso | 1.39 $\xb1$ 0.03 | 3.49 $\xb1$ 0.09 | 3.46 $\xb1$ 0.13 | 4.53 $\xb1$ 0.13 | |

OU | True – micro | 1.22 $\xb1$ 0.03 | 3.14 $\xb1$ 0.08 | 2.26 $\xb1$ 0.07 | 5.13 $\xb1$ 0.15 |

Theory – meso | 1.21 $\xb1$ 0.03 | 3.06 $\xb1$ 0.09 | 2.26 $\xb1$ 0.07 | 4.95 $\xb1$ 0.15 | |

MAP – meso | 1.22 $\xb1$ 0.03 | 3.11 $\xb1$ 0.08 | 2.25 $\xb1$ 0.06 | 5.30 $\xb1$ 0.14 | |

Impulse | True – micro | 1.54 $\xb1$ 0.05 | 3.64 $\xb1$ 0.11 | 5.32 $\xb1$ 0.46 | 5.11 $\xb1$ 0.23 |

Theory – meso | 1.59 $\xb1$ 0.06 | 3.63 $\xb1$ 0.11 | 7.99 $\xb1$ 0.66 | 5.88 $\xb1$ 0.36 | |

MAP – meso | 1.70 $\xb1$ 0.07 | 3.96 $\xb1$ 0.13 | 7.74 $\xb1$ 0.59 | 6.00 $\xb1$ 0.36 |

Note: Measures computed from 60 realizations of each model.

. | . | $\rho \xaf$ . | |||
---|---|---|---|---|---|

Input . | Model . | L2/3e . | L2/3i . | L4e . | L4i . |

Sine | True – micro | 0.418 | 0.386 | 0.994 | 0.948 |

Theory – meso | 0.354 | 0.348 | 0.991 | 0.945 | |

MAP – meso | 0.352 | 0.455 | 0.994 | 0.951 | |

OU | True – micro | 0.829 | 0.694 | 0.977 | 0.905 |

Theory – meso | 0.815 | 0.717 | 0.978 | 0.914 | |

MAP – meso | 0.855 | 0.756 | 0.977 | 0.916 | |

Impulse | True – micro | 0.914 | 0.879 | 0.996 | 0.927 |

Theory – meso | 0.880 | 0.858 | 0.979 | 0.870 | |

MAP – meso | 0.912 | 0.896 | 0.988 | 0.887 |

. | . | $\rho \xaf$ . | |||
---|---|---|---|---|---|

Input . | Model . | L2/3e . | L2/3i . | L4e . | L4i . |

Sine | True – micro | 0.418 | 0.386 | 0.994 | 0.948 |

Theory – meso | 0.354 | 0.348 | 0.991 | 0.945 | |

MAP – meso | 0.352 | 0.455 | 0.994 | 0.951 | |

OU | True – micro | 0.829 | 0.694 | 0.977 | 0.905 |

Theory – meso | 0.815 | 0.717 | 0.978 | 0.914 | |

MAP – meso | 0.855 | 0.756 | 0.977 | 0.916 | |

Impulse | True – micro | 0.914 | 0.879 | 0.996 | 0.927 |

Theory – meso | 0.880 | 0.858 | 0.979 | 0.870 | |

MAP – meso | 0.912 | 0.896 | 0.988 | 0.887 |

Note: Measures computed from 60 realizations of each model.

The capacity of the inferred model to generalize to unseen inputs is thus quite robust, with discrepancies between inferred and ground-truth models occurring only when the test and training input were very different. Of course this is in part due to mesoGIF being a good representation of the activity of homogeneous GIF neurons: while inference may compensate for some discrepancies between the model and the data, it still can work only within the freedom afforded by the model.

## 3 Discussion

Population models play a key role in neuroscience: they may describe experimental data at the scale they are recorded and serve to simplify the dynamics of large numbers of neurons into a human-understandable form. These dynamics may occur on a range of scales, from the mesoscopic, limited to a single cortical column, to the macroscopic, describing interactions between regions across the entire brain. Mechanistic models allow us to bridge those scales, relating microscale interactions to meso- or macroscale dynamics; of these, the model chosen for this study allows for rich dynamics at the single level by including synaptic, refractory, and adaptation dynamics.

We have demonstrated that it is possible to fit a mechanistic population model to simulated data by maximizing the likelihood of its parameters, in much the same way as is already done with phenomenological models (Macke et al., 2011; Pillow et al., 2008; Zhao & Park, 2016). Since mechanistic models describe concrete, albeit idealized, biophysical processes, they have the additional benefit that their parameters can be understood in terms of those processes. Moreover, those parameters are typically not dependent on the applied input, and thus we can expect the inferred model to generalize to novel stimulus conditions.

We also found that after making a few parameters heterogeneous, averaging did not recover the most representative parameters. In general, when there is discrepancy between model and data, the effective parameters are difficult to recover analytically; data-driven methods then provide a valuable supplement to theoretical analysis in order to ensure that a model actually represents the intended biological process. Nevertheless, since the inference procedure is agnostic to the model, it is up to the modeler to choose one for which the effective parameters remain interpretable.

The approach we have presented requires only that a differentiable likelihood function be available and thus is not limited to neuron population models. Stochastic models of neuron membrane potentials (Goldwyn & Shea-Brown, 2011), animal populations (Wood, 2010), and transition phenomena in physics and chemistry (Horsthemke & Lefever, 2006) are examples for which parameters could be inferred using this approach.

In practice we expect some models to be more challenging than others. For instance, evaluating the likelihood of a spiking model typically involves integrating over all time courses of the subthreshold membrane potential compatible with the observed spike train (Paninski, Pillow, & Simoncelli, 2004). This integral can be difficult to evaluate accurately, especially for models incorporating adaptation and refractoriness (Mena & Paninski, 2014; Ramirez & Paninski, 2014). If evaluation of the likelihood is prohibitively expensive, likelihood-free approaches might be more appropriate (Lueckmann et al., 2017; Papamakarios & Murray, 2016).

Also of note is that we required the dynamics to be formulated as a Markov process to express the likelihood (see section 4.3). We achieved this by constructing a state vector, but the size of this vector adds substantial computational cost, and in practice there is a trade-off between the length of the integration time window and the number of units (here, neuron populations) we can infer. Since neural field models are also computationally represented by long state vectors, inference on these models would be subject to a similar trade-off. Finally, our current implementation assumes that the state $S$ (see section 4.2) can be fully reconstructed from observations. If only a partial reconstruction of $S$ is possible, undetermined components of $S$ form a latent state that must be inferred along with the parameters. This type of problem has already been studied in the context of dimensionality reduction (Cunningham & Yu, 2014; Macke et al., 2011; Rule, Schnoerr, Hennig, & Sanguinetti, 2019), and it is conceivable that such methods could be adapted to our framework. Such an approach would allow one to perform dimensionality reduction with mechanistic models of temporal dynamics.

The work of Rule et al. (2019) presents an interesting complement to ours. The authors consider a neural field model where activities are observed only indirectly via a point process, thus addressing the problem of inferring latent states. They infer both these states and the point-process parameters but assume known parameters and neglect finite-size effects for the mesoscopic model; in contrast, here we inferred the mesoscopic model parameters while assuming that population states are observed. Inferring both mesoscopic model parameters and latent states remains a challenge for both of these approaches.

To obtain posteriors, we employed a Hamiltonian Monte Carlo algorithm with minimal automatic tuning. We found this to work better than a more automatically tuned variant (see section 4.6), but it is beyond the scope of this work to provide a complete survey of sampling methods. The applicability of more recently developed algorithms such as Riemann manifold Monte Carlo (Girolami & Calderhead, 2011), sequential Monte Carlo (Moral, Doucet, & Jasra, 2006), and nested sampling (Skilling, 2006) would be worth exploring in future work. Variational methods such as that described by Kucukelbir, Tran, Ranganath, Gelman, and Blei (2017) are another alternative to estimating posteriors that do not require sampling at all. They generally scale to large parameter spaces but do not provide the asymptotic guarantees of MCMC and may artificially smooth the resulting posterior.

Important obstacles to using inference on complex models are the implementation and computational costs. Software tools developed for this work have helped limit the former, but the latter remain a challenge, with many of the figures shown requiring multiple days of computation on a personal workstation. While manageable for studying fixed networks, this would become an impediment for scaling to larger models or tracking the evolution of parameter values by inferring them on successive time windows. For such tasks, further work would be required to reduce the inference time, for example, by investigating how large the integration time step for the mesoGIF model can be made or by optimizing the current implementation. One might also attempt to derive a computationally simpler model or make better use of parallelization and/or graphical processing units.

As Rule et al. (2019) noted, an additional complication to inferring mechanistic model parameters is that they may be underconstrained. In our case, since mesoGIF is a rate model, the voltage scale can be chosen freely by setting the resting ($urest$) and threshold ($uth$) potentials; if we nonetheless attempt to infer them along with the noise scale ($\Delta u$), fits are unable to converge (see sections 2.2 and 4.5). We avoided this problem by identifying the problematic parameters and fixing them to their known values. However, the development of a more systematic approach to dealing with underconstrained parameters is left for future investigations.

Since inference time is highly dependent on computational complexity, there is a trade-off between bottom-up models that attempt to match dynamics as closely as possible and simpler top-down models that aim for computational efficiency; while the latter tend to provide better scalability, the former are likely to be more interpretable and allow for extrapolation to new dynamical regimes (see section 2.6). Choosing the right model thus remains a key component of data analysis and modeling.

Inference methods based on machine learning allow for flexible model design, using known biophysical parameter values when they are available and inference to determine the others that are consistent with data. We hope this work further motivates the use of richer models in neuroscience by providing tools to fit and validate them.

## 4 Methods

### 4.1 Microscopic Model

We consider an ensemble of neurons grouped into $M$ populations; the symbols $i$, $j$ are used to label neurons and $\alpha $, $\beta $ to label populations. The neuron indices $i$, $j$ run across populations and are thus unique to each neuron.

*generalized integrate-and-fire*mechanism: leaky integration with adapting threshold, followed by an escape rate process. For each neuron $i$, the membrane potential $ui$ and firing threshold $\u03d1i$ evolve according to

### 4.2 Mesoscopic Model

### 4.3 Likelihood for the Mesoscopic Model

*not*occur around the true parameters. (See the discussion by Schwalger et al., 2017, p. 48.) We found that with double precision, a tolerance $\epsilon =1\xd710-8$ worked well.

For numerical stability, logarithms of factorials are computed with a dedicated function such as SciPy's gammaln (Jones, Oliphant, & Peterson, 2001). For optimization, the term $logN\alpha !$ can be omitted from the sum since it is constant.

### 4.4 Initializing the Model

Although the updates to the state $S$ are deterministic (see section 4.2), only the components $n\alpha (k0)$ of the initial state $S(k0)$ are known; unobserved components can easily number in the thousands. We get around this problem in the same manner as in Schwalger et al. (2017): by making an initial guess that is consistent with model assumptions (e.g., survival counts sum to $N\alpha $) and letting the system evolve until it has forgotten its initial condition. We note that the same problem is encountered when training recurrent neural networks, whereby the first data points are used to burn-in unit activations before training can begin. For the results we presented, we used a variation of the initialization scheme used by Schwalger et al. (2017), which we call the “silent initialization.” In this scheme, neurons are assumed to have never fired, and thus they are all ``free” (see algorithm 1). This results in large spiking activity in the first few time bins, which then relaxes to realistic levels.

This initialization scheme has the advantage of being simple and needing no extra computation, but with the high-dimensional internal state $S$, it also requires a large burn-in time of around 10 s. This can be largely mitigated by using sequential batches (see algorithm 2).

We also experimented with intializing the model at a stationary point (see appendix C), but in the cases we considered, it did not provide a notable improvement in computation time.

### 4.5 Estimating Parameters

To maximize the likelihood, we used Adam (Kingma & Ba, 2014), a momentum-based stochastic gradient descent algorithm, for which gradients were computed automatically with Theano (Team et al., 2016) (see section 4.9). Training parameters are listed in Table 3.

Despite the similarities, there remain important practical differences between fitting the mesoscopic model and training a recurrent neural network (RNN). Notably, RNN weights are more freely rescaled, allowing the use of single precision floating-point arithmetic. In the case of the mesoscopic model, the dynamic range is wider, and we found it necessary to use double precision.

Compared to a neural network, the mesoscopic update equations (see equations E.1 to E.23) are also more expensive to compute, in our case slowing parameter updates by at least an order of magnitude.

The sub-sequences of data (“minibatches”) used to train an RNN are usually selected at random: at each iteration, a random time step $k0$ is selected, from which the next $Bburnin$ data points are used for burn-in and the following $B$ data points form the minibatch. This becomes problematic when long burn-in times are required, not only because it requires long computation times but also because it wastes a lot of data. We addressed this problem by keeping the state across iterations (see algorithm 2), since this is a good guess of what it should be after updating the parameters, it reduces the required burn-in time by an order of magnitude. However, this requires batches to follow one another, breaking the usual assumption that they are independently selected. In practice, this seemed not to be a problem; in anecdotal comparisons, we found that training with either randomly selected batches and stationary initialization (see appendix C) or sequential batches and silent initialization (see algorithm 1) required comparable numbers of iterations to converge to similar parameter values. Computation time in the case of random batches, however, was much longer.

We also found that bounding the gradient helped make inference more robust. We set maximum values for each gradient component and rescaled the gradient so that no component exceeded its maximum (see algorithm 2, lines 7 to 10).

Maximizing the posterior rather than the likelihood by multiplying the latter by parameter priors (to obtain the MAP estimate rather than the MLE) helped prevent the fit from getting stuck in unphysical regions far from the true parameters, where the likelihood may not be informative. We used noninformative priors (see Table 7) so as to ensure that they did not artificially constrain the fits. Fits were also initialized by sampling from the prior.

Choosing adequate external inputs may also affect fit performance, as in general, sharp stimuli exciting transients on multiple timescales tend to be more informative than constant input (Iolov, Ditlevsen, & Longtin, 2017). That said, even under constant input, the fluctuations in a finite-sized neuron population still carry some information, and anecdotal evidence suggests that these can be sufficient to infer approximate model parameters. In this article, we used a sinusoidal input with frozen white noise to train the mesoGIF model; with only one dominant timescale, this input is more informative than constant input but far from optimal for the purpose of fitting. This made it a reasonable choice for computing baseline performance measures.

Finally, to allow fits to converge, it is essential to avoid fitting any ill-defined or degenerate parameters. For example, as explained in section 2.2, we fixed the parameters $urest$ and $uth$ because the mesoGIF model is invariant under a rescaling of the voltage; for simplicity we also fixed $ur$ and $R$ even though this was not strictly necessary. The parameters $w$ and $p$ are similarly degenerate (see equation E.7), and we fixed $p$. The parameters $N$, $\Delta $, and $tref$ are effectively discrete (either in numbers of neurons or time bins), and they were also fixed to simplify the implementation. Table 2 summarizes the inferred and noninferred parameters.

### 4.6 Estimating the Posterior

The posteriors in section 2.5 were obtained using Hamiltonian Monte Carlo (Betancourt & Girolami, 2013; Neal, 2012) (HMC). Having expressed the likelihood with Theano made it straightforward to use the implementation in PyMC3 (Salvatier, Wiecki, & Fonnesbeck, 2016)—HamiltonianMC—to sample the likelihood; the sampling parameters we used are listed in Table 4.

Although straightforward, this approach pushes the limit of what can be achieved with currently implemented samplers: because the likelihood of this model is expensive to evaluate, even coarse distributions can take hours to obtain. In addition, the large state vector required sufficiently large amounts of memory to make the automatically tuned NUTS (Hoffman & Gelman, 2014) sampler impractical. (NUTS stores the most recent states in order to tune the sampling parameters.) In an application with experimental data, one would want to reserve sufficient computational resources to perform at least basic validation of the obtained that posterior, using, for example, the methods described in Gelman et al. (2014) and Talts, Betancourt, Simpson, Vehtari, and Gelman (2018).

In order for samplers to find the high-probability-density region in finite time, we found it necessary to initialize them with the MAP estimate. This also ensured that their mass matrix was tuned on an area of the posterior with appropriate curvature. In applications where the posterior has multiple modes, one should be able to identify them from the collection of fits. The high-probability-density region around each mode should then be sampled separately, integrated, and combined with the others to obtain the full posterior. (See, e.g., van Haasteren, 2014, for integration methods for MCMC chains.)

Finally, as with parameter optimization, we found that the use of at least double-precision floats was required in order to obtain consistent results.

### 4.7 Measuring Performance

In order to assess the performance of our inference method, we quantified the discrepancy between a simulation using ground-truth parameters and another using inferred parameters; the same input was used for both simulations and was different from the one used for training. Following Augustin et al. (2017), discrepancy was quantified using both correlation ($\rho $) and root mean square error ($RMSE$); these are reported according to the amount of data $L$ used to train the model, which may be given in either time bins or seconds.

Because the models are stochastic, equation 4.24 and 4.25 describe random variables. Thus, for each of our results, we generated ensembles of realizations ${Atrue,r}r=1R1$, ${Atrue,r'}r=1R2$, and ${A^r}r=1R3$, each with a different set of random seeds. We compute the $\rho $ and $RMSE$ for the $R1\xd7R2$ pairs $(Atrue,r,A^r)$, as well the $R1\xd7R3$ combinations $(Atrue,r,Atrue,r')$, from which we empirically estimate the mean and standard deviation of those measures. Values for the pairs $(Atrue,r,Atrue,r')$ provide an estimate of the best achievable value for a given measure.

*trial-averaged*traces:

We found the pair of measures $(\rho \xaf,RMSE)$ (see equations 4.25 and 4.26) to provide a good balance between information and conciseness (see section 2.3). We generally used $R1=R2=50$ and $R3=100$ for the ensembles, with the exception of Figure 4, where $R1=R2=R3=20$. We also ensured that sets of trial-averaged measures use the same number of trials, to ensure comparability.

### 4.8 Stimulation and Integration Details

#### 4.8.1 Sine-Modulated White Noise Input

#### 4.8.2 OU Process Input

#### 4.8.3 Impulse Input

#### 4.8.4 Numerical Integration

For all simulations of the mesoGIF model, we used a time step of 1 ms. We also used a 1 ms time step when inferring parameters. Simulations of the microscopic GIF model require finer temporal resolution, and for those we used time steps of 0.2 ms. In order to have the same inputs at both temporal resolutions, they were generated using the finer time step and coarse-grained by averaging.

We used the Euler-Maruyama scheme to integrate inputs; the GIF and mesoGIF models are given as update equations of the form $A(t+\Delta t)=F(A(t))$, and thus already define an integration scheme.

### 4.9 Software

We developed software for expressing likelihoods of dynamical systems by building on general-purpose machine learning libraries: Theano_shim (https://github.com/mackelab/theano_shim) is a thin layer over the numerical back-end, allowing one to execute the same code either using Theano (Team et al., 2016) or Numpy (Jones et al., 2001). Sinn (https://github.com/mackelab/sinn) makes use of theano_shim to provide a back-end-agnostic set of high-level abstractions to build dynamical models. Finally, a separate repository (https://github.com/mackelab/fsGIF) provides the code specific to this article.

## Appendix A: Priors and Parameter Values

For both microscopic and mesoscopic models, unless otherwise specified in the text, we used the same parameter values as our ground-truth values. Values are listed in Table 7 and are based on those given in Schwalger et al. (2017), and we follow the recommendation there of adjusting resting potentials $urest$ to maintain realistic firing rates. To facilitate simulations, we also reduced the population sizes by a factor of 50 and correspondingly upscaled the connectivity weights by a factor of $50$, to maintain a balanced E-I network (Vogels, Rajan, & Abbott, 2005).

Prior distributions on inferred parameters were set sufficiently broad to be considered noninformative. Prior distributions are independent of the population so as to ensure that any inferred feature (e.g., excitatory versus inhibitory connections) is due to the data.

## Appendix B: Inferred Parameters

The inferred parameters for the heteregeneous two-population model (see section 2.4) show an overall reduction of connection strengths and time constants, compared with ground-truth values (see Table 9). For the homogeneous four-population model (see section 2.6), the tendency is less systematic (see Table 10).

## Appendix C: Alternative Initialization Scheme

Compared to the silent initialization (see section 4.4), the stationary initialization described by algorithm 3 finds a more realistic initial state, which reduces the burn-in time required by about an order of magnitude. This makes it more practical when minibatches are selected at random, and we used this scheme to validate algorithm 2 (see section 4.5). However, in general, we found the computational gain to be offset by the added cost of solving a self-consistent equation for each batch.

### C.1 Stationary Initialization

Assuming zero external input, we find a self-consistent equation for the stationary activity $A*$ (see appendix I). After solving numerically for $A*$, the other state variables are then easily computed.

## Appendix D: Data Requirements for Different Parameter Sets

In section 2.3, we showed that less than 10 s of data were sufficient to infer the parameters of the two-population mesoGIF model. Of course, the exact data requirements will depend on how many parameters we need to infer and which they are (e.g., $w$ versus $\tau m$).

To explore this issue, we repeated the inference procedure for the parameter subsets listed in Table 11, performing 24 fits for each subset using different amounts of data. Subsets $\eta 1$ and $\eta 2$ parameterize, respectively, the connectivity and the adaptation, while $\eta 3$ is the full set used for Figure 4. A similar figure to Figure 4 with all three subsets is shown in Figure 9A.

With the smaller subsets ($\eta 1$, $\eta 2$), 1.25 s of data was sufficient to get good accuracy of the inferred dynamics (see Figure 9A). However working with such small amounts of data incurs a substantial computational cost. First, the fits converge less consistently, thus requiring more fits to find a good estimate of the MAP (see Figures 9B–D, left). Second, the algorithm optimizations making use of the longer traces (see section 4.5) are no longer as effective, making each iteration slower on average.

## Appendix E: Mesoscopic Update Equations

This section first describes the quantities composing the state vector for the mesoGIF model and then lists the equations used for this article. All equations are for discretized time, and we use a superscript $(k)$ to indicate the $k$th time step. For derivations and a fuller discussion of the variables involved, see Schwalger et al. (2017).

### E.1 Construction of the State Vector

In order to obtain a finite state vector (see section 4.2), neurons are divided into two categories: free and refractory; the assignment of neurons to either category changes over time, following a discretized form of the transport equation, 4.10.

*Refractory neurons* are still in the absolute or relative refractory period caused by their last spike, and thus have a higher firing threshold. Since the height of the threshold is dependent on that spike's time, we track a vector $m\alpha (k)$, indexed by the age $l$. We define the scalar $m\alpha ,l(k)$ as our estimate of the number of neurons at time $tk$ that last fired at time $tk-l$. A vector $v\alpha (k)$ similarly tracks the variance of that estimate. The adaptation of the neurons depends on their age, such that their firing rate is also given by a vector, $\lambda \alpha (k)$. With an adaptation timescale $\tau \theta $ of 1 s and time steps of 1 ms, these vectors each comprise around $K=1000$ age bins. For a more detailed discussion on properly choosing $K$, see equation 86 in Schwalger et al. (2017).

*Free neurons*, meanwhile, have essentially forgotten their last spike: their firing threshold has relaxed back to its resting state, and so they can be treated as identical, independent of when that last spike was. One scalar per population, $\lambda free,\alpha (k)$, suffices to describe their firing rate. Scalars $x\alpha (k)$ and $z\alpha (k)$ respectively track the estimated mean and variance of the number of free neurons.

In the case of an infinite number of neurons, the firing rates $\lambda \alpha (k)$ and $\lambda free,\alpha (k)$ would be exact, but for finite populations, a further correction $P\Lambda ,\alpha (k)$ must be made to account for statistical fluctuations. Combining $\lambda free,\alpha (k)$, $\lambda \alpha (k)$, and $P\Lambda ,\alpha (k)$, one can compute $n\xaf\alpha (k)$, the expected number of spikes at $tk$. The definition of $n(k)$ then follows as described in section 4.2.

### E.2 Kernels and Link Function

### E.3 Update Equations

## Appendix F: Performance of Four Population Models

## Appendix G: Posterior for the Two-Population Mesoscopic Model

## Appendix H: Fit Dynamics

## Appendix I: Self-Consistent Equation for the Mesoscopic Stationary State

We derive the stationary state for the case where $Iext\u22610$. For analytical tractability, we assume that finite-size fluctuations are negligible (effectively, that $N\alpha $ is very large), such that in the stationary state, the activity is constant. We denote this activity $A*$.

Equation I.5 can be solved numerically for $A*$, after which the other state variables are easily calculated from the expressions in appendix E. We used SciPy's (Jones, Oliphant, & Peterson, 2001) root function with an initial guess of $A\alpha *=1$ to solve for $A*$. Since the stationary initialization was ultimately only used this to validate algorithm 1 (see appendix C), we did no further analysis of equation I.5, and in particular leave the determination of conditions for which its solutions are unique to future work.

## Acknowledgments

We thank the reviewers for improving the manuscript through many valuable comments. We are especially grateful to Tilo Schwalger for his critical comments prior to submission. We also thank Pedro Gonçalves, Giacomo Bassetto, and David Dahmen for further comments and discussions. A.R. and A.L. were supported by NSERC (Canada); A.L. also acknowledges support from the Humboldt Foundation. J.H.M. was supported by the German Research Foundation (DFG) through SFB 1089 and the German Federal Ministry of Education and Research (BMBF, project ADMIMEM, FKZ 01IS18052 A-D).