## Abstract

Winner-take-all (WTA) refers to the neural operation that selects a (typically small) group of neurons from a large neuron pool. It is conjectured to underlie many of the brain's fundamental computational abilities. However, not much is known about the robustness of a spike-based WTA network to the inherent randomness of the input spike trains. In this work, we consider a spike-based $k$–WTA model wherein $n$ randomly generated input spike trains compete with each other based on their underlying firing rates and $k$ winners are supposed to be selected. We slot the time evenly with each time slot of length 1 ms and model the $n$ input spike trains as $n$ independent Bernoulli processes. We analytically characterize the minimum waiting time needed so that a target minimax decision accuracy (success probability) can be reached.

## 1 Introduction

Humans and animals can form a stable perception and make robust judgments under ambiguous conditions. For example, we can easily recognize a dog in a picture regardless of its posture, hair color, and whether it stands in the shadow or is occluded by other objects. One fundamental feature of brain computation is its robustness to the randomness introduced at different stages, such as sensory representations (Kinoshita & Komatsu, 2001; Hubel & Wiesel, 1959), feature integration (Kourtzi, Tolias, Altmann, Augath, & Logothetis, 2003; Majaj, Carandini, & Movshon, 2007), decision formation (Platt & Glimcher, 1999; Shadlen & Newsome, 2001), and motor planning (Harris & Wolpert, 1998; Li, Chen, Guo, Gerfen, & Svoboda, 2015). It has been shown that neurons encode information in a stochastic manner in the brain (Baddeley et al., 1997; Kara, Reinagel, & Reid, 2000; Maimon & Assad, 2009; Ferrari, Deny, Marre, & Mora, 2018); even when the exact same sensory stimulus is presented or when the same kinematics are achieved, no deterministic patterns in the spike trains exist. Facing environmental ambiguity, humans and animals adaptively refine their behaviors by incorporating prior knowledge with their current sensory measurements (Faisal, Selen, & Wolpert, 2008; Knill & Pouget, 2004; Stocker & Simoncelli, 2006; Ernst & Banks, 2002; Körding & Wolpert, 2004). Nevertheless, it remains relatively unclear how neurons carry out robust computation facing ambiguity. Sparse coding is a common strategy in brain computation; to encode a task-relevant variable, often only a small group of neurons from a large neuron pool are activated (Olshausen & Field, 2004; Perez-Orive et al., 2002; Hromádka, DeWeese, & Zador, 2008; Quiroga, Kreiman, Koch, & Fried, 2008; Karlsson & Frank, 2008; Redgrave, Prescott, & Gurney, 1999). Understanding the underlying neuron selection mechanism is highly challenging.

Winner-take-all (WTA) is a hypothesized mechanism to select proper neuron from a competitive network of neurons and is conjectured to be a fundamental primitive of cognitive functions such as attention and object recognition (Riesenhuber & Poggio, 1999; Itti, Koch, & Niebur, 1998; Yuille & Geiger, 1998; Maass, 2000; Hertz, Krogh, Palmer, & Horner, 1991; Shamir, 2006). Among these studies, it is commonly assumed that neurons transmit information with a continuous variable such as the firing rate. This assumption, however, ignores how temporal coding may also contribute to cortical computations. For example, some neurons in the auditory cortex will respond to auditory events with bursts at a fixed latency (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Nelken, 2004). This phase-locking property is also observed in the hippocampus as well as the prefrontal cortex (Siapas, Lubenov, & Wilson, 2005; Hahn, Sakmann, & Mehta, 2006; Buzsáki & Chrobak, 1995). Another feature that has been neglected in a rate-based model is the inherent noise in the inputs. Although some studies used additive gaussian noise (Kriener, Chaudhuri, & Fiete, 2017; Li, Li, & Wang, 2013; Lee, Itti, Koch, & Braun, 1999; Rougier & Vitay, 2006) to account for input randomness, such WTA circuits are very sensitive to noise and could not successfully select even a single winner unless extra robustness strategy, such as an additional nonlinearity, is introduced into the dynamics (Kriener et al., 2017). Finally, neurons have a refractory period, which prevents spikes from backpropagating in axons (Berry & Meister, 1998), and such a feature is usually neglected in the rate-based models. In contrast, a spike-based model may capture these neglected features. Nevertheless, how WTA computation can be implemented and its algorithmic characterization remain relatively underexplored (Shamir, 2006, 2009).

In this letter, we study a spike-based $k$-WTA model wherein $n$ randomly generated input spike trains are competing with each other with their underlying firing rates, and the true winners are the $k$ input spike trains whose underlying firing rates are higher than others (Hertz et al., 1991). A desired WTA circuit should quickly respond to these random input spike trains and should successfully select the $k$ true winners with high probability. We analytically characterize the minimum amount of waiting time needed so that a target minimax decision accuracy (defined in section 3.2) can be reached. More precisely, we slot the time evenly with each time slot of length 1 ms and assume that these $n$ input spike trains are generated by $n$ independent Bernoulli processes with different rates. We use Bernoulli processes to capture the randomness in the input spike trains rather than using the popular Poisson processes because a Bernoulli process can be viewed as the time-slotted version of a refractory-period-modified Poisson process. Notably, a Bernoulli process with a 1 ms time slot is just a simplified approximation to the real dynamics in the brain, given that in the brain, the refractory period varies across neurons and the refractory period of some neuron could extend beyond 1 ms. In our model, we implicitly assume that the absolute refractory period is 1 ms, a value commonly reported in the literature (Teleńczuk, Kempter, Curio, & Destexhe, 2017; Nicholls, Martin, Wallace, & Fuchs, 2001).^{1} A WTA circuit contains $n$ output neurons, each of which is paired with an input spike train. In addition, the behaviors (spike patterns) of these output neurons encode which input spike trains are declared to be the winners. For special cases where $k=1$, different winner declaration strategies are considered in the literature (Shamir, 2006, 2009; Lynch, Musco, & Parter, 2016; Kriener et al., 2017), such as the identity of an output neuron that spikes much more frequently than the other output neurons (Kriener et al., 2017), of the neuron that fires the first spike in a population of neurons (Shamir, 2009, 2006), and of the output neuron that fires alone for a sufficiently long time (Lynch et al., 2016). Clearly, the minimum amount of waiting time needed to achieve a given accuracy varies with the choice of winner declaration strategy. Nevertheless, in order to derive a lower bound that holds for all winner declaration strategies, at this point, we do not specify the winner declaration strategy used in our circuit construction; this specification is postponed to section 5. In this letter, we investigate two closely related problems: (1) the fundamental limits of any WTA circuit in selecting $k$ true winners from $n$ independent Bernoulli input spike trains (in terms of waiting time to achieve a target accuracy) and (2) the existence of WTA circuits that can achieve the above fundamental limits.

^{2}In this circuit, there are $n$ pairs of input and output neurons and no hidden neurons. Each input neuron is connected to the corresponding output neuron, and the $n$ output neurons mutually inhibit each other. Each output neuron has a local memory of length $m$ (formally defined in section 2.2) and adopts a simple threshold activation function (specified in section 5.1.3). The first $k$ output neurons that spike in the same time slot are declared to be the winners; the identities of such $k$ output neurons are the circuit's estimate of the $k$ true winners. The formal circuit construction can be found in section 5. We show that for any fixed $\delta \u2208(0,1)$, provided that

In addition, our results give a set of testable hypotheses on neural recordings and human and animal behaviors in decision making (detailed discussion is found in section 6).

## 2 Computational Model: Spiking Neuron Networks

In this section, we provide a general description of our computation model; there is much freedom in choosing the detailed specification of the model. We consider such a general model so that our derived lower bound applies to WTA circuits with, for example, many alternative network architectures, activation functions, and winner declaration strategies (i.e., the desired behaviors of the output neurons). In section 5, we provide a circuit construction (for solving the $k$–WTA competition) under this computation model but with specific choices for the adopted network architecture, activation function, and winner declaration strategy.

### 2.1 Network Structure

A spiking neuron network (SNN) $N=U,E$ consists of a collection of neurons $U$ that are connected through synapses $E$ (see Figure 1). We assume that an SNN can be conceptually partitioned into three nonoverlapping layers: input layer $Nin$, hidden layer $Nh$, and output layer $Nout$. The neurons in each of these layers are referred to as input neurons, hidden neurons, and output neurons, respectively. The synapses $E$ are essentially directed edges: $E:=(\nu ,\nu '):\nu ,\nu '\u2208U$. For each $\nu \u2208U$, define $PRE\nu :=\nu ':(\nu ',\nu )\u2208E$ and $POST\nu :=\nu ':(\nu ,\nu ')\u2208E$. Intuitively, $PRE\nu $ is the collection of neurons that can directly influence neuron $\nu $; similarly, $POST\nu $ is the collection of neurons that can be directly influenced by neuron $\nu $.^{3} We assume that the input neurons cannot be influenced by other neurons in the network: $PRE\nu =\u2300$ for all $\nu \u2208Nin$. Each edge $(\nu ,\nu ')$ in $E$ has a weight, denoted by $w(\nu ,\nu ')$. The strength of the interaction between neuron $\nu $ and neuron $\nu '$ is captured as $w(\nu ,\nu ')$. The sign of $w(\nu ,\nu ')$ indicates whether neuron $\nu $ excites or inhibits neuron $\nu '$: In particular, if neuron $\nu $ excites neuron $\nu '$, then $w(\nu ,\nu ')>0$; if neuron $\nu $ inhibits neuron $\nu '$, then $w(\nu ,\nu ')<0$. The set $E$ might contain self-loops with $w(\nu ,\nu )$ capturing the self-excitatory and self-inhibitory effects. Typically, in neuroscience, a neuron is either excitatory or inhibitory: $sign(w(\nu ,\nu 1))=sign(w(\nu ,\nu 2))$ for all $\nu 1,\nu 2\u2208POST\nu $. Our order-optimal WTA circuit in section 5 indeed assumes this common sign restriction. Nevertheless, our lower bound holds even for the general case where there exist $\nu 1,\nu 2\u2208POST\nu $ such that $sign(w(\nu ,\nu 1))\u2260sign(w(\nu ,\nu 2))$.

#### 2.1.1 Generic Network Structure for WTA Circuits

### 2.2 Network State

In most neurons, the synaptic plasticity time window is about 80 to 120 msec, but it could also vary across brain regions and vary across different timescales under different behavioral contexts. In a sense, the synaptic plasticity time window is closely related to $m$. As can be seen in section 5, our order-optimal WTA circuit construction requires $m$ to be sufficiently high. Nevertheless, this does not exclude the application of our WTA circuit to the contexts where $m$ is small. This is because the memory variable can be implemented by a chain of hidden neurons near neuron $\nu $. The detailed implementation of the local memory does not affect the order optimality of our WTA circuit.

## 3 Minimax Decision Accuracy and Success Probability

### 3.1 Random Input Spike Trains

We study the $k$–WTA model, wherein $n$ randomly generated input spike trains are competing with each other, and as a result of this competition, $k$ out of them are selected to be the winners. In contrast, most existing works (Verzi et al., 2018; Maass, 1997; Lynch et al., 2016) assume deterministic input spike trains.

Recall that time is slotted into intervals of length 1 ms. We assume that the $n$ input spike trains are generated from $n$ independent Bernoulli processes with unknown parameters $p1,\u2026,pn$, respectively. We refer to $p=p1,\u2026,pn$ as a rate assignment of the WTA competition for a given external stimulus. For example, suppose an external stimulus induces two input spike trains with rates 0.6 and 0.8, respectively: $n=2$ and $p=0.6,0.8$. In each time, with probability 0.6, the first input spike train has a spike independently of whether the second input spike train has a spike, and the same is true for the second input spike train. Notably, different external stimuli induce different rate assignment vectors $p$'s. Henceforth, we use the terms *rate assignment* and *external stimulus* interchangeably.

Note that in the most general scenario, the spikes of the input neurons might be correlated (see section 6 for detailed comments). We would like to explore the more general input spikes in our future work.

### 3.2 Minimax Performance Metric

We adopt the minimax framework (Wu, 2017) of a WTA circuit.

^{4}Later, we use the same notation to denote the $n$ spike trains with random rate assignment (i.e., where $p$ is randomly generated). Nevertheless, this abuse of notation significantly simplifies the exposition without sacrificing clarity.

## 4 Information-Theoretic Lower Bound on Decision Time

In this section, we provide a lower bound on the decision time for a given decision accuracy. The lower bounds derived in this section hold universally for all possible network structures (including the hidden layer), synapse weights, activation functions, and winner declaration strategies.

^{5}The KL divergence between Bernoulli random variables with parameters $r$ and $r'$, respectively, is defined as

Note that $D(P\u2225Q)\u22650$ and $D(P\u2225Q)=0$ if and only if $P=Q$ except for measure 0. Similar to $d(\xb7\u2225\xb7)$, $D(P\u2225Q)$ is not symmetric in $P$ and $Q$. In this letter, we choose the base to be 2.^{6} Recall that the set of admissible rate assignments $AR$ is defined in equation 3.2.

The two different rate assignments $p$ and $q$ correspond to two different external stimuli, and $D(PS\u2225QS)$ is the “distance” between the $n$ input spike trains of length $T$ induced by the first external stimulus and those induced by the second external stimulus. Lemma ^{9} is proved in appendix B.

^{3}), no matter how elegant the design of a WTA circuit is (no matter which activation function we choose, how many hidden neurons we use, and how we connect the hidden neurons and output neurons), its minimax decision accuracy is always lower than the target decision accuracy $(1-\delta )$.

^{3}says that if $T<(1-\delta )log(k(n-k)+1)-1TR$, the worst-case probability error of any WTA circuit is greater than $\delta $:

^{11}is proved in appendix C.

(Tightness of the Lower Bound in Theorem ^{3}). The proof of theorem ^{3} uses a technical supporting lemma (lemma ^{13}, presented in appendix C). Following our line of argument, by considering a richer family of critical rate assignments in lemma ^{13}, we might be able to obtain a tighter lower bound. Nevertheless, the constructed WTA circuit in section 5 turns out to be order-optimal; its decision time matches the lower bound in theorem ^{3} up to a multiplicative constant factor. This immediately implies that the lower bound obtained in theorem ^{3} is tight up to a multiplicative constant factor.

## 5 Order-Optimal WTA Circuits

In section 2, we provided a general description of the computation model we are interested in. In this section, we construct a specific WTA circuit whose decision time is order-optimal among the WTA circuits under the general computation model. To do that, we need to specify (1) the network structure, including the number of hidden neurons, the collection of synapses (directed communication links) between neurons, and the weights of these synapses; (2) the memorization capability of each neuron, that is, the magnitude of $m$; and (3) $\phi \nu $, the activation function used by neuron $\nu $. In the constructed circuit, we declare the first k output neurons that spike simultaneously as winners.

### 5.1 Circuit Design

In our designed circuit, there are four parameters, $R$, $m$, $b$, and $\delta $, where $R\u2286[c,C]$^{7} is a finite set of rates from which the $pi$'s of the input spike trains are chosen, $m$ is the memory range, $b$ is the bias at the noninput neurons, and $(1-\delta )$ is the target decision accuracy (i.e., success probability). Here, we assume that every noninput neuron has the same bias: $b\nu =b$ for all non-input neurons $\nu $. The four parameters $R$, $m$, $b$, and $\delta $ can be viewed as some prior knowledge of the WTA circuit; they might be learned through some unknown network development procedure outside the scope of this work. In sections 5.1.1, 5.1.3, and 5.1.4, we present the network structure and the activation functions adopted, and the requirement on $m$. For completeness, we specify the local memory update (in particular, the vector $V$) separately in section 5.1.2. The dynamics of our WTA circuit is summarized in section 5.1.5.

#### 5.1.1 Network Structure

We propose a WTA circuit with the following network structure:

All output neurons are connected to each other by a complete graph. That is, $(vi,vj)\u2208E$ for all $vi,vj\u2208Nout$ such that $vi\u2260vj$.

Each edge from an input neuron to an output neuron has weight 1: $w(ui,vi)=1$ for all $ui\u2208Nin,vi\u2208Nout$.

All edges among the output neurons have weights $-1k$; that is, $w(vi,vj)=-1k$ for all $vi,vj\u2208Nout$ such that $vi\u2260vj$.

There are no hidden neurons: $Nh=\u2205$.

#### 5.1.2 Update Local Charge Vector

It is easy to see the following claims hold. For brevity, their proofs are omitted:

Claim 5. For $t\u22651$ and for $i=1,\u2026,n$, $Vt-1(vi)>0$ if and only if $St-1(ui)=1$ and $\u2211j:1\u2264j\u2264n,&j\u2260iSt-1(vj)\u2264k-1$; at time $t-1$, input neuron $ui$ spikes, and fewer than $k-1$ other output neurons spike.

Claim 6. For $t\u22651$ and for $i=1,\u2026,n$, $Vt-1(vi)\u2264-1$ only if $\u2211j:1\u2264j\u2264n,&j\u2260iSt-1(vj)\u2265k$, that is, at time $t-1$, more than $k$ other output neurons spike. Note that $\u2211j:1\u2264j\u2264n,&j\u2260iSt-1(vj)\u2265k$ is not a sufficient condition to have $Vt-1(vi)\u2264-1$. To see this, suppose $\u2211j:1\u2264j\u2264n,&j\u2260iSt-1(vj)=k$ and $St-1(ui)=1$. In this case it holds that $Vt-1(vi)=0$.

Claim 7. For $t\u22651$ and for $i=1,\u2026,n$, if $Vt-1(vi)=0$, one of the following holds: (1) $St-1(ui)=1$ and $\u2211j:1\u2264j\u2264n,&j\u2260iSt-1(vj)=k$, that is, at time $t-1$, input neuron $ui$ spikes, and exactly $k$ other output neurons spike, and (2) $St-1(ui)=0$ and $\u2211j:1\u2264j\u2264n,&j\u2260iSt-1(vj)=0$, that is, at time $t-1$, input neuron $ui$ does not spike, and no other output neurons spike.

#### 5.1.3 Activation Functions

*Activation function*, n.d.) for a detailed list. In our construction, we use a simple threshold activation function,

#### 5.1.4 Local Memorization Capability

Intuitively, when other parameters are fixed, the higher the desired accuracy (i.e., the smaller $\delta $), the larger the required minimum memory $m*$, that is, the more memory is needed for selecting the winners in our WTA circuit. Similarly, the easier it is to distinguish two independent spike trains with different rates (i.e., the lower $TR$), the smaller $m*$ is. Interestingly, with other parameters fixed, $m*$ depends on $k$ as follows: $m*$ is increasing in $k$ when $k\u22081,\u2026,\u230an2\u230b$, and $m*$ is decreasing in $k$ when $k\u2208\u2308n2\u2309,\u2026,n-1$. In many practical settings, we care about the region where $k\u226an$. Besides, with the choice of bias $b=cm*$, the larger $m*$ also implies longer time is needed for our WTA circuit to declare $k$ winners (details can be found in statement 1 in theorem ^{6}).

On the other hand, in most neurons the synaptic plasticity time window is about 80 to 120 ms, and it is unclear whether equation 5.2 can be immediately satisfied. Fortunately, even if equation 5.2 is not immediately satisfied by a neuron due to its local bioplausibility, it is possible that its local memory might be realized via some population codes such as a chain of hidden neurons.

#### 5.1.5 Algorithm 1

### 5.2 Circuit Performance

Fix $\delta \u2208(0,1]$ and $1\u2264k\u2264n-1$. Choose $m\u2265m*$ and $b=maxcm*,2$. Then for any admissible rate assignment $p$, with probability at least $1-\delta $, the following statements hold:

There exist $k$ output neurons that spike simultaneously by time $m*$.

The first set of such $k$ output neurons are the true winners $W(p)$.

From the first time in which these $k$ output neurons spike simultaneously, these $k$ output neurons spike consecutively for at least $b$ times, and no other output neurons can spike within $b$ times.

The proof of theorem ^{6} can be found in appendix D. The first statement in that theorem implies that our WTA circuit can provide an output (a selection of $k$ output neurons) by time $m*$; the second statement says that the circuit's output indeed corresponds to the $k$ true winners; and the third says that the $k$ simultaneous spikes of the selected winners are stable—the $k$ selected winners continue to spike consecutively for at least $b$ times. The proof of theorem ^{6} essentially says that with high probability, under algorithm 1, the number of output neurons that spike simultaneously are monotonically increasing until they reach $k$. Upon the simultaneous spike of $k$ output neurons, by our threshold activation rule, we know that the other output neurons are likely to be inhibited. In particular, if these $k$ output neurons are the first $k$ output neurons that spike simultaneously, then the activation of the other output neurons is likely to be inhibited for at least $b$ times.

^{6}, in the activation function of algorithm 1,

^{6}. In fact, we can increase the stability period by introducing a stability parameter $s$ such that $1<s\u2264m$ and modifying the activation rule. Details can be found in algorithm 2. It is easy to see that the activation function falls under the general form in statement 3. In the new activation function in algorithm 2, for output neuron $vi$, once it spikes, it continues to spike for at least $s$ times. Following our line of analysis in the proof of theorem

^{6}, it can be seen that the declared $k$ winners, from the first time they spike simultaneously, continue to spike consecutively for at least $s$ times.

(Order-Optimality). The decision time performance in statement 1 of theorem ^{6} matches the information-theoretical lower bound in theorem ^{3} up to a multiplicative constant factor both when $\delta $ is sufficiently small and does not depend on $n$, $k$, $TR$, $c$, and $C$, and when $\delta $ decays to zero at a speed at most $1(k(n-k))c0$ where $c0>0$ is some fixed constant. The detailed order-optimality argument is given next.

*Suppose that $\delta $ is sufficiently small and does not depend on $n$, $k$, $TR$, $c$, and $C$.* Here, for ease of exposition, we illustrate the order-optimality with a specific choice of $\delta $. In fact, the order-optimality holds generally for constant $\delta \u2208(0,1)$ provided that it does not depend on $n$, $k$, $TR$, $c$, and $C$.

^{3}that to have $\delta =0.1$, the decision time is no less than

^{3}.

*Suppose $\delta $ decays to zero at a moderate speed*. The decision time of our WTA circuit is order-optimal even for diminishing decision error $\delta $ provided that $\delta =\Omega (3(k(n-k))c0)$ where $c0>0$ – it does not decay to zero “too fast” in $k(n-k)$. To see this, let $\delta =3k(n-k)c0$ for some constant $c0>0$. We have

*Resetting the circuit when the input spike trains become quiescent.* In algorithm 1, if the input spike trains become quiescent, then the corresponding circuits also become quiescent despite some delay in this response.

If all input neurons are quiescent at time $t0$ and remain quiescent for all $t\u2265t0$, then $Vt(vi)=0$ and $St(vi)=0$ for any $t>t0+m$.

Lemma ^{9} is proved in appendix E.

## 6 Discussion

In this letter, we investigated how $k$-WTA computation is robustly achieved in the presence of inherent noise in the input spike trains. In a spike-based $k$-WTA model, $n$ randomly generated input spike trains are competing with each other, and the neurons with the top $k$ highest underlying firing rates are the true winners. Given the stochastic nature of the spike trains, it is not trivial to properly select winners among a group of neurons. We derived an information-theoretic lower bound on the decision time for a given decision accuracy. Notably, this lower bound holds universally for any WTA circuit that falls within our model framework, regardless of their circuit architectures or their adopted activation functions. Furthermore, we constructed a circuit whose decision time matches this lower bound up to a constant multiplicative factor, suggesting that our derived lower bound is order-optimal. Here the order-optimality is stated in terms of its scaling in $n$, $k$, and $TR$. In addition, our results also give a set of testable hypotheses on neural recordings and human and animal behaviors in decision making.

### 6.1 Comparison to Previous WTA Models

Randomness is introduced at different stages of brain computation, and the stochastic nature of the spike trains is well observed (Baddeley et al., 1997; Kara et al., 2000; Maimon & Assad, 2009; Shamir, 2009, 2006; Hertz et al., 1991; Ferrari et al., 2018). In our work, we focused on how to robustly achieve $k$-WTA computation in face of the intrinsic randomness in the spike trains. A common WTA model assumes that neurons transmit information by a continuous variable such as the firing rate (Dayan & Abbott, 2001; Hertz et al., 1991), which often ignores the intrinsic randomness in spiking trains. Although some studies used additive gaussian noise (Kriener et al., 2017; Li et al., 2013; Lee et al., 1999; Rougier & Vitay, 2006) in their rate-based WTA circuits to account for input randomness, these circuits are usually very sensitive to noise and could not successfully select even a single winner unless additional nonlinearity is added (Kriener et al., 2017). In fact, a neuron with a second nonlinearity is similar to an output neuron in our constructed WTA circuit in that both integrate their local inputs. Unfortunately, only simulation results were provided in Kriener et al. (2017); a theoretical justification of why such second nonlinearity makes their WTA circuit robust to input noise is lacking. Random response of rate-based WTA is also considered in Shamir (2006) with a focus on characterizing the scaling of WTA accuracy with the population size for a two-interval, two-alternative forced choice (2I2AFC) discrimination task.

Though we focused on a spike-based model, we hope our results can provide some insight into the rate-based model as well. On top of that, a rate-based model would require a high communication bandwidth, yet communication bandwidth is limited in the brain. Our spiking neural network model captures this feature by having a low communication cost, since it broadcasts 1 bit only. However, we did not try to model every biologically relevant feature. In several studies using spiking network models, individual units are often modeled with details like ion channels and specific synaptic connectivity. Though more biologically relevant than our spiking neuron network model, those details significantly complicate the analysis. In fact, it could be challenging and intricate to move beyond computer simulation to characterize the model dynamics (e.g., the spiking nature of each unit, the time it takes to stabilize), analytically.

Spike-based WTA is also considered in the insightful work (Shamir, 2009) under a statistical model for a two-alternative forced-choice (2AFC) discrimination task. In particular, Shamir (2009) undertook an elegant study on the accuracy of his WTA mechanism focusing on the effects of population size, noise correlations, and baseline firing. Compared to Shamir (2009), our model is more restrictive in the sense that we do not consider the effects of population size, noise correlations, and baseline firing, yet it is more general in the sense that we consider $n\u22652$ alternatives. In addition, we take a slightly different but closely related angle; instead of focusing on characterizing the accuracy with regard to a particular WTA circuit, we provide a general lower bound that provides insight into the fundamental limits of a WTA circuit on the waiting time in deciding among independent Bernoulli input spike trains. Nevertheless, all of the features studied in Shamir (2009)—population size, noise correlations, and baseline firing—are interesting, and we would like to try to extend our results to incorporate these features in our future work.

### 6.2 Potential Applications for Physiological Experiments

Our work might further provide hypotheses on inferring the changes of the network sizes, of the similarities between input spike trains, and of the synaptic memory capacities base on the changes of the performance accuracy. For example, in behavioral experiments using electrolytic lesions or pharmacological inhibition (Clark, Manes, Antoun, Sahakian, & Robbins, 2003; Hanks, Ditterich, & Shadlen, 2006; Yttri, Liu, & Snyder, 2013; Katz, Yates, Pillow, & Huk, 2016), the changes in performance are often highly variable and nonlinear. Such variability and nonlinearity might arise from the experimental difficulties in precisely manipulating network size and disentangling sensory perception and motor planning from a core decision-making (winner-selecting) process. With an analytical characterization, one might be able to estimate changes in the network size given its performance changes. Several pioneering works studied the impact of the network size on accuracy (Seung & Sompolinsky, 1993; Shamir, 2009, 2006). While these works characterized this trade-off based on investigating specific WTA circuits, our work provides a complementary viewpoint by characterizing a lower bound on a large family of WTA circuits.

Besides the effect of network size, the distribution of feature representations (i.e., different set $R$s of different individual animals) could be used to account for between-subject variability in decision making. Consider a random-dot coherent motion task where animals need to decide in which of two directions the majority of dots are moving (Shadlen & Newsome, 2001). In this task, performance accuracy and reaction time vary across animals. If we perform neural recordings in their visual cortex (i.e., to record their $R$s), we might be able to decode their reaction time or accuracy, given population representations of dot motion in these cortical neurons (Shadlen & Newsome, 1996; Jazayeri & Movshon, 2006). For example, an animal whose stimulus-evoked responses are more heterogeneous in the visual cortex might be able to react faster given the same accuracy, governed by our derived lower bound.

Our work also offers predictions on how local memory capacity could affect performance in decision making. For example, when there is more ambiguity in input representations, to achieve the same accuracy, a larger minimum time window for memory storage in synapses (Knoblauch, Palm, & Sommer, 2010) is required. From previous experimental work (Bittner, Milstein, Grienberger, Romani, & Magee, 2017), we know that synaptic plasticity has timescale ranging from milliseconds to seconds across different brain regions, and such plasticity could efficiently store entire behavioral sequences within synaptic weights. Combined with our analytical characterization, when performance accuracy changes over time, assuming other parameters such as input rates, decision time, and network size are fixed, one might be able to predict how synaptic plasticity changes.

### 6.3 Limitations and Extensions

When $\delta $ is a constant, our lower bound is order-optimal in terms of its scaling in $n$, $k$, and $TR$. Nevertheless, the scaling of the derived lower bound in terms of $\delta $ is not tight. It would be interesting to know the optimal scaling in $\delta $ when other parameters ($n$, $k$, and $TR$) are fixed. We leave it as one future direction.

To simplify complexity, our model poses a few assumptions that ignored some features in the brain (Shamir, 2009). One of these assumptions is that each input neuron is independent. However, various degrees of average noise correlations between cortical neurons have been reported. For example, average noise correlations in primary visual cortex could be close to 0.1 (Schölvinck, Saleem, Benucci, Harris, & Carandini, 2015), 0.18 (Smith & Kohn, 2008), or even much larger, as 0.35 (Gutnisky & Dragoi, 2008). Similarly, noise correlations have been observed in other sensory brain regions (Cohen & Kohn, 2011). In our work, we ignore correlations between these neurons, but it would be interesting as a future direction to extend in our spiking network model. Unfortunately, the impact of the noise correlation on the lower bound is unclear at first glance. One of the challenges in answering such question's is, in general, that the details of correlations might matter—especially when there is more than one true winner, and it is unclear whether general statements such as “correlations always hurt” or “correlations always help” can be concluded in the end. Specifically, on the one hand, the insightful work of Shamir (2009) showed that, similar to the effect of noise correlation that has been observed in population coding theory, noise correlations in their proposed temporal winner-take-all (tWTA) limits and harms the accuracy of the tWTA readout. In fact, in population coding theory, it is commonly reported that noise correlation harms decoding accuracy (Eyherabide & Samengo, 2013). On the other hand, correlations in the variabilities of neuronal firing rates do not, in general, limit the increase in coding accuracy; in some cases, correlations improve the accuracy of a population code (Abbott & Dayan, 1999; Averbeck, Latham, & Pouget, 2006). Additionally, for the problem of $k$-WTA where $k\u22652$, it could be possible that the noise correlation is neither purely positively corrected nor purely negatively corrected. In particular, it could be possible that one true winner is positively correlated with other true winners and is negatively correlated with nonwinners, and another true winner is negatively correlated with other true winners and is positively correlated with nonwinners. Thus, extra care is needed when one is trying to make claims on the impact of noise correlation on a WTA circuit.

Second, our model uses a threshold activation function by assuming the synaptic transmission is basically noise free and that the only noise source comes from the input in this letter. However, synaptic transmission is highly unreliable in biological networks (Allen & Stevens, 1994; Faisal et al., 2008; Borst, 2010), and a deterministic activation function would fail to capture this feature compared to a stochastic activation function. Nevertheless, our lower bound in theorem ^{3} holds even if the activation functions are random. This is because the probability in $P{win^(S)\u2260W(p)}$ incorporates the possible randomness in the activation functions, and our lower-bound characterization is independent of the activation functions used.

Another assumption in our circuit is that the output neurons can inhibit each other. In common scenarios, an output neuron is usually excitatory and does not inhibit other neurons directly without recruiting inhibitory cells. We incorporate stability in these output neurons by assuming they can inhibit each other in our circuit implementation. For a model where an output neuron is limited to be excitatory only, we can add a chain of inhibitory neurons to achieve stability WTA computation.

Additionally, for our lower bound to hold, we need that the initial memory of each neuron, $M1(\nu )$, contains no information about the system's state in the past $t\u22640$. That is, except for the input spike trains, no side information (especially the one on previous network dynamics) is available at a WTA circuit, and nothing happens before the start of WTA competition to affect the WTA dynamics. We impose this assumption on $M1(\nu )$ in order to derive an information-theoretic lower bound on the observation time. On the other hand, spontaneous firings before the presence of an external stimulus might affect the initial states of neurons' local memory. For those scenarios, our results are applicable provided that the spontaneous firings are very sparse or even negligible. Nevertheless, it would be interesting to relax this assumption and study how the spontaneous firings of the neurons in the past (i.e., $t\u22640$) could affect $M1(\nu )$ in general.

Finally, in our $k$-WTA circuit, the number of output neurons that spike simultaneously increase monotonically until there are exactly $k$ output neurons that spike simultaneously. We acknowledge that this might not be biologically plausible in most cases in the brain, especially considering the possibility of spontaneous firings. From large-scale neural recordings, we know that the number of neurons that spike simultaneously is usually variable, so this could be a future direction to construct a circuit that better matches experimental observations.

## Appendix A: Preliminaries

In this section, we present some preliminaries on information measures and Fano's inequality. Interested readers are referred to Polyanskiy and Wu (2014) for comprehensive background.

### A.1 Information Measures

Let $X$ and $Y$ be two random variables. The mutual information between $X$ and $Y$, denoted by $I(X;Y)$, measures the dependence between $X$ and $Y$, or, the information about $X$ (resp. $T$) provided by $Y$ (resp. $X$).

In the following, we use the notation $X\u2192Y$ to denote that $Y$ is a (possibly random) function of $X$. Thus, $W\u2192X\u2192Y\u2192W^$ means that $X$ is a (possibly random) function of $W$; $Y$ is a (possibly random) function of $X$; and $W^$ is a (possibly random) function of $Y$. Fano's inequality:

(Chernoff Bound). Let $X1,\u2026,Xn$ be $i.i.d.$ with $Xi\u22080,1$ and $PX1=1=p$. Set $X=\u2211i=1nXi$. Then

For any $t\u2208[0,1-p]$, we have $PX\u2265p+tn\u2264exp-nd(p+t\u2225p)$.

For any $t\u2208[0,p]$, we have $PX\u2264p-tn\u2264exp-nd(p-t\u2225p)$.

## Appendix B: Proof of Lemma ^{2}

Lemma ^{2} follows easily from the independence between input spike trains and the assumption that the spikes in each input spike train are i.i.d. For completeness, we present the proof as follows.

^{2}.

## Appendix C: Proof of Theorem ^{3}

The following lemma is used in the proof of our information-theoretic lower bound. This is a technical supporting lemma, and the choice of the specific rate assignments is due to some technical convenience in proving theorem ^{3}. See appendix A for definition of $I(\xb7;\xb7)$.

^{3}.

We prove this via a genie-aided argument (Jacobs & Berlekamp, 1967) by assuming that there is a genie that can access the firing sequences of all the $n$ input neurons. By assuming the existence of a genie, we are essentially considering the centralized setting. Clearly, if the error probability is high even in the centralized setting, then no SNNs (which are distributed algorithms) can achieve lower error probability.

^{13}. Let $P$ be the set of such rate assignments. By Yao's minimax principle, we know the minimax probability of error is always lower-bounded by Bayes' probability of error with any prior distribution:

^{11}), we have

^{13}, we get

## Appendix D: Proof of Theorem ^{6}

The proof of theorem ^{6} uses the following technical fact and lemma.

This fact follows immediately from a simple algebra:

Now we are ready to prove theorem ^{6}.

^{6}.

^{6}, it is enough to show that with probability $1-\delta $,

^{6}, and

^{6}. By the choice of $t0$, we know that on $E$, $t0+1$ is the first time that $k$ output neurons spike simultaneously, and no other $k$ output neurons ever spike simultaneously, proving statement 2 in theorem

^{6}.

^{6}.

which is the focus of the remainder of our proof.

^{12}), the first term in equation D.5 is bounded as

^{14}, we know

^{6}.$\u25a1$

## Appendix E: Proof of Lemma ^{9}

## Notes

^{1}

We plan to investigate the impact of the heterogeneity in the refractory period on waiting time in our future work.

^{2}

In this letter, the notations $O(\xb7)$ and $\Omega (\xb7)$ are used to describe the limiting behavior of a function when the argument tends toward a particular value or infinity. In our case, the waiting time can be viewed as a function of several other parameters such as $\delta $, $R$, $n$, and $k$. Formally, for any sequences ${aN}$ and ${bN}$, we say $aN=O(bN)$ if there exists an absolute constant $c>0$ such that $aN\u2264c\xd7bN$. Similarly, we say $aN=\Omega (bN)$ if there exists an absolute constant $c>0$ such that $aN\u2265c\xd7bN$.

^{3}

In the languages of computational neuroscience, the incoming neighbors and outgoing neighbors are often referred, respectively, as presynaptic units and postsynaptic units.

^{4}

A more rigorous notation should be $S(T,p):=St(u1)t=1T,\u2026,St(un)t=1T$. We use $S$ for $S(T,p)$ for ease of exposition.

^{5}

The Kullback-Leibler (KL) divergence gauges the dissimilarity between two distributions.

^{6}

Note that any base would work. See (Polyanskiy & Wu, 2014, chap. 1.1).

^{7}

Recall that $c,C\u2208(0,1)$ are two absolute constants; they do not change with other parameters of the WTA circuit such as $n$, $k$, and $\delta $.

## Acknowledgments

We thank Christopher Quinn at Purdue University and Zhi-Hong Mao at the University of Pittsburgh for the helpful discussions and references.