## Abstract

Information transfer through a single neuron is a fundamental component of information processing in the brain, and computing the information channel capacity is important to understand this information processing. The problem is difficult since the capacity depends on coding, characteristics of the communication channel, and optimization over input distributions, among other issues. In this letter, we consider two models. The temporal coding model of a neuron as a communication channel assumes the output is τ where τ is a gamma-distributed random variable corresponding to the interspike interval, that is, the time it takes for the neuron to fire once. The rate coding model is similar; the output is the actual rate of firing over a fixed period of time. Theoretical studies prove that the distribution of inputs, which achieves channel capacity, is a discrete distribution with finite mass points for temporal and rate coding under a reasonable assumption. This allows us to compute numerically the capacity of a neuron. Numerical results are in a plausible range based on biological evidence to date.

## 1.  Introduction

It is widely believed that neurons send information to other neurons in the form of spike trains. Although precise timings of spikes are important for information transfer, it appears that spike patterns are not deterministic but noisy (Mainen & Sejnowski, 1995). Information theory shows that when a communication channel is corrupted with noise, the rate at which the information can be transmitted reliably through the channel is limited. The upper bound on the rate is known as the channel capacity (Shannon, 1948) (in the rest of the letter, it is referred to simply as “capacity”). When a single neuron is considered as a channel, the capacity is one of the fundamental problems in neuroscience.

The problem has been studied theoretically (MacKay & McCulloch, 1952; Rapoport & Horvath, 1960; Stein, 1967) and biologically (Borst & Theunissen, 1999). Computing capacity is difficult since it depends on multiple factors—type of coding, characteristics of the channel, and input distributions. The type of coding has long been a subject of discussion (MacKay & McCulloch, 1952; Baker & Lemon, 2000; Rullen & Thorpe, 2001). Mainly two types of coding, temporal and rate coding, have been considered. Temporal coding uses interspike intervals (ISIs) to code information, and rate coding uses the number of spikes in a fixed interval. This letter examines both of them.

The channel model is deeply related to the noise of ISIs. Baker and Lemon (2000) reported that the statistical properties of ISIs recorded from primary motor cortex and supplementary motor area (SMA) of monkeys are similar to the gamma distribution. Shinomoto, Shima, and Tanji (2003) and Shinomoto, Miyazaki, Tamura, and Fujita (2005) studied spike trains from multiple areas and proposed a statistical index that describes the randomness of ISIs.1 The index is deeply related to the gamma distribution (Shinomoto et al., 2003; Ikeda, 2005). In this letter, ISIs are modeled with a gamma distribution. The model is different from the channel model in MacKay and McCulloch (1952), where spikes are assumed to be aligned within a fixed time precision.

The capacity is defined as the supremum of mutual information over possible input distributions. In this letter, a natural assumption is posed, that is, the average firing rate of a single neuron is restricted in an interval. Under this assumption, we consider all possible input distributions and prove that the capacity of each coding is achieved by a discrete distribution that has only finite mass points. The proof of the discreteness of capacity-achieving distributions for each coding shares the steps with other studies of information theory (Smith, 1971; Shamai (Shitz), 1990; Abou-Faycal, Trott, & Shamai (Shitz), 2001; Gursoy, Poor, & Verdú, 2002, 2005). These studies have shown the discreteness for some channels with appropriate assumptions on the input distributions. Our result shows that the information is maximally transmitted through a single neuron when the inputs to the neuron have only a fixed number of “modes.” This is important for biological experiments, since if the input distribution is discrete, the experimentalists have to consider only discrete and finite modes of inputs or stimuli. After the proof, the capacity and the capacity-achieving distribution for each coding are computed. Unfortunately we have not obtained any analytical solution, and they are computed numerically. The results show that the capacity is around 15 to 50 bits per sec, the same order with the values reported in Borst and Theunissen (1999).

The problem is formulated mathematically in section 2, and the discreteness for each coding is proved in section 3. Section 4 shows numerical studies, and the final section concludes with some discussion. Most of the mathematical proofs are summarized in the appendix.

## 2.  Single Neuron Channel

### 2.1.  ISIs and Communication Channel.

It has been reported that a gamma distribution is a suitable model to describe the stochastic nature of ISIs (Baker & Lemon, 2000; Shinomoto et al., 2003). The gamma distribution has two parameters: the shape parameter κ and the scale parameter θ. From some studies, κ of individual neuron appears to be constant (the value of κ may depends on the type of neuron), while θ changes dynamically over time.

Figure 1 shows simulated spike trains with two different shape parameter κ's. It is 0.75 in Figure 1A and 4.5 in Figure 1B. When κ is small, spike trains become more irregular. Ikeda (2005) and Miura, Okada, and Amari (2006) studied the estimation methods of κ from spike trains. Estimation of κ is regarded as the semiparametric statistical estimation (Bickel, Klaassen, Ritov, & Wellner, 1993).

Figure 1:

Simulated spike trains. ISIs follow a gamma distribution, where the shape parameter κ is 0.75 for A and 4.5 for B. The expected values of ISI are 5 msec in the upper trains and 50 msec in the lower trains for both A and B.

Figure 1:

Simulated spike trains. ISIs follow a gamma distribution, where the shape parameter κ is 0.75 for A and 4.5 for B. The expected values of ISI are 5 msec in the upper trains and 50 msec in the lower trains for both A and B.

In this letter, we focus not on the estimation but on the information processing of a single neuron. Based on the gamma distribution model, the capacity of a neuron is investigated in the following sections.

### 2.2.  Communication Channel and Capacity.

Let X be the input to a noisy channel and Y be the output. In the following, we assume is a one-dimensional stochastic variable, and let F(·) be a cumulative distribution function of X. Communication channel is defined as a stochastic model described as p(yx), and the mutual information is defined as
2.1
Here, μ(y) denotes the measure of . Since the channel is defined as p(yx), I(X; Y) is a functional of F(·), and we denote it as I(F).
Let be the set of cumulative distribution functions of X. The channel capacity is defined as
2.2
For a noisy channel, one interesting fundamental problem is to compute the capacity C. Another interesting problem is to obtain the distribution, if it exists, which achieves the capacity.

### 2.3.  Single Neuron: Channel and Coding.

Let us discuss a neuron model. First, we have to define X and Y of a neuron communication channel.

The distribution of each ISI is assumed to be independent and to follow a gamma distribution. Let T denote an ISI, a stochastic variable following a gamma distribution, that is, T ∼ Γ(κ, θ), where κ>0 and θ>0 are the shape and the scale parameter, respectively.

We assume κ of each neuron is fixed and known. Shinomoto et al. (2003) define a statistical index LV (local index), to characterize each neuron. For a T ∼ Γ(κ, θ), holds. From their investigation with biological data, it seems most of the cells' LV are lying in an interval (0.3–1.2), and κ is thus assumed to be in an interval κ ∈ [κm, κM] (κm and κm are set to 0.75 and 4.5, respectively, in section 4).

Under the assumption, the scale parameter θ is the only variable parameter that plays the role of input, that is, X in section 2.2. The density function of t is
where we denote it as p(t ∣ θ; κ) to show θ is a stochastic variable and κ is a parameter. The gamma distribution is an exponential family:
2.3
The sufficient statistics are T and log T. The expectations of them are
where ψ(·) is the digamma function defined as ψ(x) = Γ′(x)/Γ(x) for x>0. The conditional entropy becomes
Next, let us consider the family of all the possible distributions of input θ. Noting that ISI is positive and is not infinite if the neuron is active, it is natural to assume that the average ISI, which depends on θ and κ, is limited between a0 and b0 (a0 and b0 are set to 5 msec and 50 msec, respectively, in section 4), that is,
Thus, θ is bounded in Θ(κ) = {θ ∣ a(κ) ⩽ θ ⩽ b(κ)}, where a(κ) and b(κ) are defined as
In the following, a(κ), b(κ), and Θ(κ) are denoted as a, b, and θ, respectively, as far as no confusion arises. Let us define F(θ) as the cumulative distribution function of θ and as the set of all possible F(θ), that is,
2.4
Note that every is right-continuous and nondecreasing on θ, and includes continuous and discrete distributions of θ.2

Next, let us consider Y, the output of the channel of a neuron communication channel. There are mainly two different ideas in neuroscience. One idea is that Y is ISI, T, itself (see MacKay & McCulloch, 1952, for example). This is called temporal coding (see Figure 2). The other is that Y is the rate, which is the number of spikes in fixed time intervals (see Stein, 1967). This is called rate coding (see Figure 2). In communication theory, “coding” is often used for “source coding,” “error-control coding,” and “cryptography coding.” It seems that modulation is a more suitable term for the above definition. However, we follow the standard usage of the neuroscience community. How to encode (or to modulate) the input θ to the neuron channel depends on which coding is used. For temporal coding, θ is fixed during the interval t, while θ is fixed during Δ for the rate coding. We discuss this in section 5.

Figure 2:

Two types of coding: temporal coding and rate coding.

Figure 2:

Two types of coding: temporal coding and rate coding.

Mutual information and the capacity also depend on coding. The capacity of each coding is formally defined in the following.

#### 2.3.1.  Temporal Coding.

In temporal coding, received information is T. For , we define the marginal distribution as
2.5
where p(t ∣ θ κ) is defined in equation 2.3. The existence of p(t; F, κ) follows from the existence of p(t ∣ θ κ). The mutual information of T and θ is defined as
2.6
Let us define g(t;F, κ) and rewrite p(t; F, κ) as
2.7
The mutual information IT(F) is rewritten as
where
Hence, the capacity per channel use or equivalently per spike is defined as
The capacity CT and the distribution that achieves CT are studied in the next section.

#### 2.3.2.  Rate Coding.

In rate coding, a time window is set, and the spikes in the interval are counted. Let us denote the interval and the rate as Δ and R, respectively, and define the distribution of R as p(r ∣ θ; κ, Δ). The form of the distribution of R is shown in the following lemma:

Lemma 1.
The distribution p(r ∣ θ; κ, Δ) has the following form,
2.8
where denotes the set of nonnegative integers and P(α, x) is the regularized incomplete gamma function:

Proof.

See Appendix A. The same distribution is discussed in Pawlas, Klevanov, and Prokop (2008).

When κ = 1, a gamma distribution is an exponential distribution, and the distribution of R becomes a Poisson distribution:
For an , let us define the following marginal distribution p(r;F, κ, Δ):
The existence of the integral follows from the existence of p(r ∣ θ; κ, Δ). The mutual information of R and θ is defined as
2.9
Hence, the capacity per channel use or, equivalently, per Δ is defined as
The capacity CR and the distribution that achieves CR are studied in the next section.

## 3.  Theoretical Studies

The cumulative distribution is a right-continuous nondecreasing function on a interval Θ. Thus, θ can be a discrete or continuous random variable over Θ. In this section, the capacity-achieving distribution of a single neuron channel is proved to be a discrete distribution with finite mass points for both temporal and rate coding.

For some channels, the capacity-achieving distributions have been shown to be discrete under some conditions (Smith, 1971; Shamai (Shitz), 1990; Abou-Faycal et al., 2001; Gursoy et al., 2002; Tchamkerten, 2004). The neuron channel with temporal coding is different from those because it does not have an additive noise and the proof must be provided independently. The rate coding with κ = 1 is equivalent to the Poisson channel, and the discreteness of the capacity-achieving distribution is proved in Shamai (Shitz) (1990). The proof is easily extended to the case where κ is a positive integer. But we need to prove it for positive real κ's. Note that although the proofs of the discreteness in this letter are original, they follow the same steps of those papers.

### 3.1.  Steps to Prove the Discreteness of the Capacity-Achieving Distribution.

The common steps of the proof for the discreteness of the capacity-achieving distributions are shown in this section. In the following, the results of optimization theory and probability theory will be used. Suppose X is a normed linear space. In optimization theory, the space of all bounded linear functionals of X is called the normed dual of X and is denoted X*. The weak* convergence is defined as follows:

Definition 1.

A sequence {x*n} in X* is said to converge weak* to the element X* if for every x ∈ X, x*n(x) → x*(x). In this case, we write (Luenberger, 1969, 5.10).

If X is the real normed linear space of all bounded continuous functions on , X* includes the set of all probability measures, and it is clear that “weak convergence” of probability measures is “weak* convergence” on X*. The results of optimization theory are applied to probability measures with this equivalence. The following theorem is used to prove the existence and the uniqueness of the capacity-achieving distribution:

Theorem 1.
Let J be a weak* continuous real-valued functional on a weak* compact subset S of X*. Then J is bounded on S and achieves its maximum on S. If S is convex and J is strictly concave, then the maximum,
3.1
is achieved by a unique X* in .

Proof.

See Luenberger (1969, 5.10), Abou-Faycal et al. (2001), and Gursoy et al. (2002).

From the above discussion, in equation 2.4 is a subset of X*. It is clear that is convex. Thus, if is weak* compact and IT(F) (or IR(F)) is a weak* continuous function on and strictly concave in , the capacity is achieved by a unique distribution F0 in . This is the first step of the proof. The following proposition states is compact.

Proposition 1.

in equation 2.4 is compact in the Lévy metric topology.

Proof.

For the proof of compactness, see Smith (1971) (proof of proposition 1). The proof is a direct application of Helly's compactness theorem (Doob, 1994, sec. X).

The Kuhn-Tucker (K-T) condition on the mutual information is used for the next step of the proof. Before showing the condition, we define the weak differentiability:

Definition 2.
Let J be a function on a convex set . Let F0 be a fixed element of and η ∈ [0, 1]. Suppose there exists a map such that
Then J is said to be weakly differentiable in at F0 and JF0(F) is the weak derivative in at F0. If J is weakly differentiable in at F0 for all , J is said to be weakly differentiable in .

The K-T condition is described as follows:

Proposition 2.
Assume J is a weakly differentiable, concave functional on a convex set . If J achieves its maximum on at F0, then a necessary and sufficient condition for F0 to attain the maximum is to satisfy the following inequality for all :

Proof.

See proposition 1 in Smith (1971).

If IT(F) (or IR(F)) is weakly differentiable, the K-T condition is derived immediately with the theorem. Finally, the discreteness is proved by deriving a contradiction based on the K-T condition and the assumption that F0 has infinite mass points as its support. Thus, in order to show the discreteness of the capacity-achieving distribution for temporal and rate codings, the following properties must be shown:

1. IT(F) and IR(F) are weak* continuous on and strictly concave.

2. IT(F) and IR(F) are weakly differentiable.

After these are shown, the K-T condition is derived, and the discreteness and the finiteness will be checked.

### 3.2.  Discreteness of the Capacity-Achieving Distribution for Temporal Coding.

In this section, the capacity-achieving distribution for temporal coding is shown to be a discrete distribution with a finite number of points. We start with the following lemma:

Lemma 2.

IT(F) in equation 2.6 is a weak* continuous function on and strictly concave in

Proof.

Section B.1 proves IT(F) is a weak* continuous function. IT(F) can be proved to be strictly concave following the proof of lemma 2 in Abou-Faycal et al. (2001).

Lemma 2 and theorem 1 imply that the capacity for temporal coding is achieved by a unique distribution in In order to show it is a discrete distribution, the following lemma and corollary are used:

Lemma 3.
IT(F) in equation 2.6 is weakly differentiable in . The weak derivative at has the form
3.2

Proof.

See section B.2.

Corollary 1.
Let E0 denote the points of increase of F0(θ) on θ ∈ [a, b]. F0 is optimal if and only if
3.3

Proof.

This is proved following the same steps in Smith (1971, corollary 1), with equation 3.2.

The main result of this section is summarized in the following theorem:

Theorem 2.

Under the constraint θ ∈ Θ, the channel capacity of a single neuron channel with temporal coding is achieved by a discrete distribution with a finite number of mass points.

Proof.
The extension of iT(θ; F0) to the complex plain z is analytic for Re z > 0, which is defined as
If E0 in corollary 1 has infinite points, since Θ is bounded and closed, E0 has a limit point. Hence, from corollary 1, the identity theorem implies iT(z; F0) = IT(F0) + κ for the region Re z > 0. This region includes positive real line , and
3.4
is implied. The left-hand side of equation 3.4 is bound as follows (see section B.1, equation B.4):
3.5
Since the expectation of T with regard to p(t ∣ θ κ) is κ θ, equation 3.5 shows that the left-hand side of equation 3.4 grows linearly with θ. Since the right-hand side increases only with log θ, equation 3.4 cannot hold for all . This is the contradiction, and the optimal distribution has a finite number of mass points.

### 3.3.  Discreteness of the Capacity-Achieving Distribution for Rate Coding.

The capacity-achieving distribution for rate coding is shown to be a discrete distribution with a finite number of points. Shamai (Shitz) (1990) proved that the capacity-achieving distribution of a Poisson channel under peak and average power constraints is a discrete distribution with a finite point of supports. Since θ ∈ Θ is a peak constraint, this directly proves the case κ = 1. For κ ≠ 1, further study is needed.

Lemma 4.

IR(F) in equation 2.9 is a weak* continuous function on and strictly concave in .

Proof.

Section C.1 proves IR(F) is a weak* continuous function. The concavity of IR(F) can be proved as in Abou-Faycal et al. (2001). The proof for the strict concavity follows the proof in section 7.2 of Shamai (Shitz) (1990), which is an application of Carleman's theorem (Akhiezer, 1965).

Lemma 4 and theorem 1 imply that the capacity for rate coding is achieved by a unique distribution in :

Lemma 5.
IR(F) in equation 2.9 is weakly differentiable in . The weak derivative at has the form
3.6

Proof.

The proof is identical to the proof of lemma 3 in section B.2.

Corollary 2.
Let E0 denote the points of increase of F0(θ) on θ ∈ [a, b]. F0 is optimal if and only if
3.7

Proof.

This is proved following the same steps in Smith (1971, corollary 1) with equation 3.6.

Finally, the following theorem proves that the capacity-achieving distribution is a discrete distribution with a finite number of mass points:

Theorem 3.

Under a peak constraint, the channel capacity of a single neuron channel with the rate coding is achieved by a discrete distribution with a finite number of mass points.

Outline of proof.
The proof follows the same steps of theorem 2. The extension of iR(θ;F) to the complex plain z is defined as
Since P(α, z) and log z is analytic for Re z > 0, iR(z; F0) is analytic for Re z > 0.
If E0 in corollary 2 has infinite points, since Θ is bounded and closed, E0 has a limit point and, hence, from equation 3.7, the identity theorem implies iR(z; F0) = IR(F0) for the region Re z > 0. This region includes positive real line , and
3.8
is implied. The proof (see section C.2) is completed by deriving a contradiction for equation 3.8. The contradiction is derived for κ ≥ 1 and κ < 1 separately.

## 4.  Numerical Studies

Although the capacity-achieving distribution of each coding has been proved to be discrete with a finite number of mass points, position and probability of each point are not provided. Unfortunately, we do not have an analytic solution. This is also the case for related work (Smith 1971; Shamai (Shitz), 1990; Abou-Faycal et al., 2001; Gursoy et al., 2005). In this section, the capacity and the capacity-achieving distribution are computed numerically for temporal and rate coding.

### 4.1.  Common Steps of Numerical Experiments.

Computing the capacity and the capacity-achieving distribution of the neuron channel is difficult since the closed-form expression of iT(θ;F) in equation 2.6 and iR(θ;F) in equation 2.9 is not provided for a general discrete F(θ). Instead, we need to evaluate integrals for iT(θ;F) and summations of infinite series for iR(θ;F). For the numerical studies, integrals for iT(θ;F) are evaluated with the Gauss-Laguerre quadrature, and infinite series for iR(θ;F) are truncated to sufficiently long finite series.

The strategy to compute the capacity and the capacity-achieving distributions for temporal and rate coding is as follows. Note that related work uses similar methods (Smith, 1971; Abou-Faycal et al., 2001; Gursoy et al., 2005).

1. Initialize N, the number of points, as 2.

2. Set the position and probability of each point as θj and πj, (j = {1,…,N}), respectively, where
3. Starting from some initial values, maximize the corresponding mutual information (IT(F) or IR(F)) with respect to {θi} and {πi} until convergence with a gradient method.

4. When it converges, check the corresponding K-T condition in (equations 3.3 or 3.7) to see if it is the capacity-achieving distribution.

5. If the K-T condition is satisfied, the capacity and the capacity-achieving distribution are obtained. Otherwise, increase N by 1 and go to step 2.

The range of θ must be specified for the numerical studies. The range of the expected firing rate is defined as from 5 msec to 50 msec and 5/κ ≤ θ ≤ 50/κ. The choice of the range is discussed in section 5.

The capacity and the capacity-achieving distribution for temporal and rate coding are computed for multiple values of κ. As described in section 2.3, a statistical index LV has been proposed that characterizes spike trains (Shinomoto et al., 2003). Its expectation is related to κ as . In the following numerical studies, we vary κ from 0.75 to 4.5 (the corresponding is from 0.3 to 1.2) for every 0.05. The range corresponds to the most of the cells' LV in Shinomoto et al. (2003, 2005).

### 4.2.  Temporal Coding.

Figure 3A shows the computed capacity for each κ. The capacity CT (bit per channel use) increases monotonically as κ increases.3 This is natural since as κ increases, ISIs become more regular, and more information can be transferred. The capacity becomes larger than 1 bit when κ becomes 3.85.

Figure 3:

Numerical results of temporal coding. (A) Capacity CT (bit per channel use) for each κ. (B) Information rate CT (bit per sec) for each κ (see equation 4.1). C and D show the capacity-achieving distribution computed for each κ. (C) The probability mass points. For every κ, two points are on the edges of Θ(κ) (a0/κ and b0/κ, shown as ○ and ⊲, respectively). The third point × appears as κ becomes 2.10. (D) Probability of each point shown as the height. The axis for κ θ is logarithmically scaled for visual clarity.

Figure 3:

Numerical results of temporal coding. (A) Capacity CT (bit per channel use) for each κ. (B) Information rate CT (bit per sec) for each κ (see equation 4.1). C and D show the capacity-achieving distribution computed for each κ. (C) The probability mass points. For every κ, two points are on the edges of Θ(κ) (a0/κ and b0/κ, shown as ○ and ⊲, respectively). The third point × appears as κ becomes 2.10. (D) Probability of each point shown as the height. The axis for κ θ is logarithmically scaled for visual clarity.

The capacity-achieving distributions are shown in Figures 3C and 3D. For each κ, the distribution has only two or three points. Moreover, two of them are ends of the range Θ(κ) (a0/κ and b0/κ). If κ is smaller than 2.10, there are only two points. When it is equal to 2.10, the number of points becomes three. The position of the third point is very stable for different κ's. The probability of each point is shown in Figure 3D. The probabilities of both ends tend to be similar, while the probability of the third point increases gradually as κ increases.

The capacity CT, is the maximum information transferred per spike. It is also important to show the information rate. Since the capacity-achieving distribution is computed, the following CT (bit per sec) is defined:
4.1
Note that is around 25 msec for all κ in the experiments. The information rate is shown as the function of κ in Figure 3B. Further discussion is provided in section 5.

### 4.3.  Rate Coding.

In rate coding, the time window Δ must be defined. Since the average time for sending a symbol with temporal coding is around 25 msec, Δ is set to 25 msec in the numerical experiment.

Figure 4A shows the computed channel capacity for each κ. CR increases monotonically as κ increases. The value is larger than CT for the same κ. It becomes larger than 1 bit when κ becomes 2.15.

Figure 4:

Numerical results of rate coding. (A) Capacity CR (bit per channel use) for each κ. (B) Information rate CR (bit per sec) for each κ. C and D show the capacity-achieving distribution computed for each κ. (C) The probability mass points. For every κ, two points are on the ends of the range (a0 and b0, shown as ○ and ⊲, respectively). The third point × appears as κ becomes larger than 1.20. The fourth point * appears as κ becomes 4.0. (D) Probability of each point shown as the height. The axis for κ θ is logarithmically scaled for visual clarity.

Figure 4:

Numerical results of rate coding. (A) Capacity CR (bit per channel use) for each κ. (B) Information rate CR (bit per sec) for each κ. C and D show the capacity-achieving distribution computed for each κ. (C) The probability mass points. For every κ, two points are on the ends of the range (a0 and b0, shown as ○ and ⊲, respectively). The third point × appears as κ becomes larger than 1.20. The fourth point * appears as κ becomes 4.0. (D) Probability of each point shown as the height. The axis for κ θ is logarithmically scaled for visual clarity.

The capacity-achieving distributions are shown in Figures 4C and 4D. For each κ, the distribution has two to four discrete points, and two of them are ends of the range Θ(κ) (a0/κ and b0/κ). For κ < 1.25, there are only two points. For 1.25 ≤ κ < 4, there are three points, and it becomes four for κ ≥ 4.0. The probability of each point is shown in Figure 3D. The probabilities of both ends tend to be similar, while the probability of the third point increases gradually as κ increases. When the number of mass points is four, two middle points have similar probability.

In rate coding, the information rate is easily computed. Since Δ is fixed, the rate is computed as CR = CR/Δ (bit per sec), which is shown in Figure 4B.

## 5.  Discussion and Conclusion

We have proved the channel capacities of a single neuron with temporal and rate coding are achieved with discrete distributions. Numerical studies show that the number of mass points is from two to four depending on coding and κ. The capacity of a single neuron evaluated in this letter is lower than what has been reported in MacKay and McCulloch (1952) and Rapoport and Horvath (1960) (1000 to 4000 bits per sec), and its order is similar to biologically measured capacities of sensory neurons (Borst & Theunissen, 1999). However, this does not mean the capacity can be achieved biologically. The problem has been simplified in our study, and the details should be discussed. Since channel capacity depends on various factors, each factor is discussed separately in the rest of this section.

### 5.1.  Encoding: Input Distribution of θ.

First, we discuss the input θ. Since the ISI is positive and is not infinite if the neuron is active, the constraint (α ⩽ θ ⩽ β) seems to be natural. The range of θ has been set to [5 msec, 50 msec] throughout the letter. The firing rate of each neuron depends on its type, and this range may not be plausible for some neurons. Note that for temporal coding, if the “dynamic range” of the firing rate is 10 dB, the capacity per channel use is identical to the result of this letter. The capacity of rate coding depends on the dynamic range and Δ; therefore, the capacity result of this letter may not be appropriate for some neurons.

In the range Θ(κ), the distribution of θ has been assumed to be memoryless, that is, θ can be different for every channel use. Scale parameter θ must be changed every 5 msec at most in temporal coding and 25 msec in rate coding. Biologically speaking, θ corresponds to the input to a neuron, and it cannot be changed quickly since the neuron has capacitance. Thus, the source would have memory. This implies the biologically achievable rate should be smaller than the capacity obtained in the numerical studies.

Another problem is the duration to keep the input θ, especially for temporal coding. When θ is fixed for some duration, the neuron fires according to the gamma distribution; however, the “sender” cannot know when the “receiver” receives the spike. In order to detect an ISI, the receiver must receive two spikes, and it is not clear how the sender can be synchronized with the receiver. One idea is to have a common clock and fix θ in an interval. This situation turns out to be rate coding. Another idea is to fix θ for a time proportional to the expected ISI, κ θ. In this case, the receiver may miss some spikes. In either case, the transmitted information will be lower than the numerically computed capacity.

When κ = 1, the rate coding becomes identical to the “Poisson channel” (Bar-David, 1969; Shamai (Shitz), 1990; Guo, Shamai (Shitz), & Verdú, 2008). There is a great deal of work on the Poisson channel communication, and many types of constraints on the input distributions have been considered (Verdú, 1999, provides a summary of Poisson channel communications). Our constraint is a memoryless peak energy constraint, and other constraints can be added. One of the commonly used constraints is the average energy constraint, that is, . Even if we add an average energy constraint to a peak power constraint, we believe the optimal distribution is still discrete for each coding. This has been proved for Poisson channel in Shamai (Shitz) (1990), and its extension to general values of κ seems possible. For the temporal coding, the proof can be straightforwardly extended, as in Smith (1971). However, we do not know how to set C, which prevents us from employing an average energy constraint. Note that adding an average energy constraint possibly makes the set , and thus the capacity, smaller, and our result is the upper bound of the capacity with an average energy constraint.

The capacity-achieving distributions are discrete distributions with finite mass points. Although this is good in the sense that neurons can transfer information maximumly with discrete numbers of “firing modes,” this does not imply neurons are using only discrete modes. The input of each neuron may vary continuously. The result in this letter shows that even if the input has rich information, the sender cannot send more information than a Markovian source with finite discrete states.

### 5.2.  Noisy Channel Model.

Characteristics of neurons strongly depend on their types. MacKay and McCulloch (1952) assumed that a neuron is be able to fire within a fixed time precision. They have concluded that the each spike can carry up to 9 bits of information, and approximately 1000 to 3000 bits per second could be transferred theoretically. Compared to some biological studies summarized in Borst and Theunissen (1999), this observation might be optimistic. We modeled the stochastic property of them with a gamma function. This is quite different from the model in MacKay and McCulloch (1952).

We set the value of κ between 0.75 to 4.5, which has been indicated in Shinomoto et al. (2003); however, in Baker and Lemon (2000), κ is set to 16, which is much larger than our choice.4 As κ increases, the capacity and the number of mass points of the capacity achieving distribution increase; therefore, the capacity and the number of mass points for κ = 16 would be much larger than our numerical results. We have not shown numerical results for κ = 16, since it is difficult to carry out numerical experiments with a large κ because of numerical precision. This may be solved in the future.

It is also interesting to consider the communication channel with multiple neurons. If there are m neurons, which follows the same gamma distribution Γ(κ, θ), the sum of ISIs follows Γ(mκ, θ) and the average of ISIs follows Γ(mκ, θ/m). Since the channel capacity CT and CR increases as κ increases, the channel capacity will be larger with multiple neurons. Note that the capacity-achieving distribution is still a discrete distribution with finite probability mass points.

### 5.3.  Decoding.

The capacity is the maximum of transferred information. In order to achieve the capacity, the receiver must act as an optimal decoder. Let us define the position and probability of the capacity-achieving distribution for temporal and rate codings as {θT,i, πT,i} and {θR,i, πR,i}, i = 1, …, N, respectively. The optimal decoder for the temporal decoder computes the following posterior probability when t is observed,
while the optimal decoder of the rate coding computes the following posterior probability when r is observed:
The discrete distributions ϖT,i and ϖR,i are the posterior distributions of the input θ conditioned on the observations. This “soft decoding” is natural from a mathematical viewpoint; however, it may not be plausible to assume that the postsynaptic neuron is computing ϖT,i and ϖR,i since the computation is complicated and the value of κ must be known by the neuron.
Another natural decoding is hard decoding; depending on t or r, only a single θ is considered as the decoding result. The Bayes optimal hard decoding is to choose the θi which maximizes the posterior distribution. In the case of single-neuron information channels, the hard coding results for temporal and rate coding are defined as, respectively,
Each decoder becomes a simple threshold function. Figure 5 shows the hard decoding boundaries for temporal and rate coding. In temporal coding, t is a nonnegative real number, and decision boundaries are shown in Figure 5A. In rate coding, r is a nonnegative integer, and decisions for integers are shown in Figure 5B.
Figure 5:

Hard decoding. (A) Hard decoding boundaries for temporal coding. The decision depends on the received ISI and κ. ○, ⊲, and × correspond to θ's in Figure 3. (B) Hard decoding for rate coding. The decision depends on the received rate and κ. ○, ⊲, and × correspond to θ's in Figure 4.

Figure 5:

Hard decoding. (A) Hard decoding boundaries for temporal coding. The decision depends on the received ISI and κ. ○, ⊲, and × correspond to θ's in Figure 3. (B) Hard decoding for rate coding. The decision depends on the received rate and κ. ○, ⊲, and × correspond to θ's in Figure 4.

Note that the boundary in Figure 5A is stable between κ = 0.75 to 2.6, and even if the capacity-achieving distribution has three states for κ ⩾ 2.10 (see Figure 3), the third point does not appear in decisions until κ>2.6. Similar results are observed for rate coding. Although the number of points is more than 3 if κ>1.20 (see Figure 4), the decision becomes three points only when κ>1.55. Even if the number of points is four for κ ⩾ 4.00, it does not appear as the hard decision. Decision boundaries are not sensitive to small changes of κ.

When hard decoding is employed, both input and output are discrete, and transferred information can be computed easily. The transferred information with the capacity-achieving distribution and the optimal hard decoders are shown in Figure 6. It shows that the transferred information is degraded from the optimal soft decoder; however, the lost information is not very large.

Figure 6:

Transferred information per channel use with hard decoders. (A) Transferred information with the capacity-achieving distribution and the optimal hard decoder for temporal coding. The dotted line shows the capacity in Figure 3A. (B) Transferred information with the capacity-achieving distribution and the optimal hard decoder for rate coding. The dotted line shows the capacity in Figure 4A.

Figure 6:

Transferred information per channel use with hard decoders. (A) Transferred information with the capacity-achieving distribution and the optimal hard decoder for temporal coding. The dotted line shows the capacity in Figure 3A. (B) Transferred information with the capacity-achieving distribution and the optimal hard decoder for rate coding. The dotted line shows the capacity in Figure 4A.

### 5.4.  Related Work.

Stein (1967) has discussed the channel capacity of the rate coding where a gamma distribution with a fixed κ was the ISI model. The input was assumed to be a discrete distribution of the scale parameter θ on an interval, which happened to be optimal, and the capacity was computed numerically in a similar manner.

Although the assumption corresponds to the optimal distribution, the discreteness had not been proved. We believe this letter is the first to prove the discreteness of the optimal distribution for general κ, not only for rate coding but also for temporal coding.

Finally we note that similar work has emerged recently in a slightly different context (McDonnell & Stocks, 2008; Nikitin, Stocks, Morse, & McDonnell, 2008).

### 5.5.  Conclusion.

The channel capacity and the capacity-achieving distribution are obtained for a single neuron information channel. ISIs are modeled with a gamma distribution, and two types of coding, temporal and rate, are considered. Capacity-achieving distributions are proved to be discrete distributions with a finite number of points. Numerical studies show that the number of the points is relatively small for a moderate choice of κ. It should also be noted that neurons may not use efficient error-control codes, which requires a fairly long delay. Instead, the actual encoding and decoding may be very simple and far from optimal as far as the rate is concerned.

The result does not necessarily imply that the neuron is using discrete states as ISIs or that the decoding is soft decoding. However, the information capacity gives the upper bound of the information, which can be transferred through a single neuron. This limit has implications. If the input is a continuous distribution, the transferred information is lower than the capacity, and if hard decoding is employed, the transferred information is lower than the capacity.

In neurophysiological experiments, many trials are accumulated because signals are generally noisy. The results of this letter provide a general guide for how much information could be obtained through a single recording. Also it gives suggestions for the field of brain-machine interface or brain-computed interface (BCI), which tries to extract information from neurons' spikes.

## Appendix A:  Proof of Lemma 1

The lemma is proved by induction.

p(0 ∣ θ; κ, Δ) is the probability that T is larger than Δ. Since T ∼ Γ(κ, θ)
where Δθ = Δ/θ. Assuming equation 2.8 is true for a , p(m + 1 ∣ θ; κ, Δ) is written as follows:
A.1
If the following relation holds for , it completes the proof:
A.2
Equation A.2 is easily checked for m = 0: ( denotes the set of positive integers) is justified as follows:
where the following relations of P(α, x) and the beta function have been used:
A.3
Note that equation A.3 follows from the following equation:
A.4
Since equation A.2 holds for , equation A.1 becomes
Equation 2.8 holds for every .

## Appendix B:  Capacity-Achieving Distribution for Temporal Coding

### B.1.  Proof of Lemma 2: IT(F) is Weak* Continuous.

IT(F) is weak* continuous if the following relation holds,
B.1
since IT(F) = hT(F; κ) − κ hT∣θ(F; κ) − κ; more precisely,
hT∣θ(Fn; κ) → hT∣θ(F; κ) holds since hT ∣ θ(Fn; κ) = ∫ab log θ dFn(θ) and log θ is a bounded continuous function for θ ∈ Θ.
Next we show the following equalities:
B.2
B.3
The interchange of integral and limit in equation B.2 is justified as follows. From equations 2.5 and 2.7, p(t; F, κ) and g(t;F, κ) are bounded as follows:
B.4
From these bounds, p(t; Fn, κ)log g(t; Fn, κ) is bounded for all Fn with finite A1 and A2 as follows:
B.5
The right-hand side of equation B.5 is integrable as
Since equation B.5 is bounded from above with an integrable function, equation B.2 is justified by the Lebesgue-dominated convergence theorem. Since p(t ∣ θ κ) and exp [− t/θ]/θκ are continuous bounded functions of θ ∈ Θ, p(t; F, κ) and g(t;F, κ) are a continuous function on F, p(t; Fn, κ)log g(t; Fn, κ) is also continuous for every . These arguments justify equation B.3, and equation B.1 is justified.

### B.2.  Proof of Lemma 3.

Let us define Fη and rewrite iT(θ;F) in equation 2.6 as follows:
Then
B.6
B.7
The weak derivative of IT(F) at F0 is defined as IT, F0(F) = limη ↓ 0(IT(Fη) − IT(F0))/η. By dividing the term in equation B.6 with η and by taking η ↓ 0, it becomes
By noting g(t; Fη, κ) = (1 − η) g(t; F0, κ) + η g(t; F, κ), the term in equation B.7 becomes 0. Thus, the weak derivative becomes
which does exist and IT(F) is weakly differentiable.

## Appendix C:  Capacity-Achieving Distribution for Rate Coding

### C.1.  Proof of Lemma 4.

First, the following proposition is shown:

Proposition 3.

The expectation of R with respect to p(r ∣ θ; κ, Δ) is finite.

Proof.
The expectation of R is
Since P(α, x) is a strictly decreasing function of α for α>0, x>0, if κ ≥ 1
Thus, the upper bound is given as
where R1, Δθ = Δθ holds from the fact that p(r ∣ θ; 1, Δθ) is a Poisson distribution. For κ < 1, P(rκ, Δθ) ⩽ P(⌊rκ⌋, Δθ) holds, and is bounded as follows:

IR(F) is weak* continuous if the following relation holds:
C.1
From the definitions of IR(F) and iR(θ, F) in equation 2.9,
Since iR(θ, F) is a positive continuous function of θ, if it is bounded from above, this is justified from the Helly-Bray theorem. It will be shown separately for κ ≥ 1 and κ < 1.
For κ ⩾ 1: Since P(α, Δθ) is a decreasing function of a, the following inequality holds from equation A.4.
C.2
With the above equation, p(r;F, κ, Δ) is bounded from below as follows:
where Δm = Δ/b and ΔM = Δ/a are the minimum and the maximum of Δθ, respectively. Thus,
where B is the following upper bound:
With the result of proposition 3, iR(θ, F) is bounded from above:
Forκ < 1: When κ < 1, the following relation holds from equation A.4:
Since P(α, x) is a decreasing function, the following relation holds:
C.3
The above equation gives the following bound of p(r;F, κ, Δ):
From the property of the gamma function, Γ(rκ + 1)/Γ(rκ + κ + 1) decreases as r increases for r>1/κ, and there exists a finite positive integer r0 ⩾ 1/κ such that, for all rr0, the following inequality holds for a positive real number C1:
Thus,
With the result of proposition 3,
where S1 is finite. It can be shown that there exists a real number C2>0, such that p(r ∣ θ; κ, Δ)>C2 for all θ ∈ Θ, r ∈ {0, …, r0 − 1} and the following sum is finite:
Thus iR(θ, F) = S1 + S2 is bounded from above.

### C.2.  Proof of Theorem 3.

First, the following proposition is shown:

Proposition 4.
As x → ∞ (), the following equation holds:
C.4

Proof of proposition 4.
From proposition 3 in section C.1, ∑r=1P(rm, x) is bounded from above with a linear function of x. Let us define the sum as Sm(x). From equation A.3,
It is easily checked that
Thus, the following linear differential equation is derived:
When the differential equation is solved, the general solution gives the following form of Sm(x):
C.5
Since ∣Re αk ∣ <1, Re(−1 + αk) < 0 holds for k ∈ {1, …, m − 1}, and limx→∞Sm(x)/x = 1/m.

Corollary 3.

As θ ↓ 0, the expectation of R with respect to p(r ∣ θ; κ, Δ) grows proportional to Δθ = Δ/θ.

Proof of corollary 3.
The expectation is bounded as follows:
From proposition 4, ∑r=1P(r⌈κ⌉, Δθ) and ∑r=1P(r⌊κ⌋, Δθ) grows proportional to Δθ, which proves the corollary.

Let us prove theorem 3.

For κ ≥ 1: From equation C.2, p(r;F, κ, Δ) is bounded from above as follows:
and
where D is the following lower bound:
This shows iR(θ, F) is bounded from below as
Since grows with Δθ as θ ↓ 0, the lower bound of iR(θ, F) grows with Δθlog Δθ. Thus, iR(θ, F) cannot be finite and constant for , which brings the contradiction.
For κ < 1: From equation C.3, p(r;F, κ, Δ) is bounded from above as follows:
C.6
Let us denote r as
r and b can be considered as stochastic variables, and the following relation holds:
Let HR and HRprime be the entropy of R and R′, respectively and HB ∣ R be the conditional entropy of B given R′. The following relation holds:
which is justified from 0 ≤ HB ∣ R ≤ log K. With this result,
Since ⌊κK⌋ = 1 holds, the probability q(r′ ∣ θ; κ, Δ) is bounded as follows:
C.7
With equations C.6 and C.7,
where E is the following lower bound:
This shows iR(θ, F) is bounded from below as
Since ∑r = 0rq(r ∣ θ; κ, Δ) is equivalent to , proposition 4 shows that it grows proportional to Δθ as θ ↓ 0. Thus, iR(θ, F) is lower bounded with a term that grows with Δθlog Δθ and iR(θ, F) cannot be finite and constant for , which brings the contradiction.

## Acknowledgments

We are grateful for helpful discussions with Mark D. McDonnell. We also thank the anonymous reviewers for valuable feedback. This work was supported by Grant-in-Aid for Scientific Research No. 18079013, MEXT, Japan.

## Notes

1

Data from pre-SMA, SMA, rostral cingulate motor area (CMAr), and prefrontal cortical area (PF) of monkeys are studied in Shinomoto et al. (2003), while data from different layers of area TE of monkeys are studied in Shinomoto et al. (2005).

2

In Stein (1967), the distribution of θ was assumed to be discrete. We do not assume it in this letter.

3

We used bit instead of nat by dividing capacity defined in equation 2.2 by log 2.

4

One of the reviewers indicated the value of κ might be much larger than that given by current references in the literature.

## References

Abou-Faycal
,
I. C.
,
Trott
,
M. D.
, &
Shamai (Shitz)
,
S.
(
2001
).
The capacity of discrete-time memoryless Rayleigh-fading channels
.
IEEE Transactions on Information Theory
,
47
,
1290
1301
.
Akhiezer
,
N.
(
1965
).
The classical moment problem
. (
N. Kemmer, Trans.
).
Edinburgh
:
Oliver & Boyd
.
Baker
,
S. N.
, &
Lemon
,
R. N.
(
2000
).
Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels
.
J. Neurophysiol.
,
84
,
1770
1780
.
Bar-David
,
I.
(
1969
).
Communication under the Poisson regime
.
IEEE Transactions on Information Theory
,
15
,
31
37
.
Bickel
,
P. J.
,
Klaassen
,
C.
,
Ritov
,
Y.
, &
Wellner
,
J.
(
1993
).
Efficient and adaptive estimation for semiparametric models
.
Baltimore, MD
:
Johns Hopkins University Press
.
Borst
,
A.
, &
Theunissen
,
F. E.
(
1999
).
Information theory and neural coding
.
Nature Neuroscience
,
2
,
947
957
.
Doob
,
J.
(
1994
).
Measure theory
.
Berlin
:
Springer-Verlag
.
Guo
,
D.
,
Shamai (Shitz)
,
S.
, &
Verdú
,
S.
(
2008
).
Mutual information and conditional mean estimation in Poisson channels
.
IEEE Transactions on Information Theory
,
54
,
1837
1849
.
Gursoy
,
M. C.
,
Poor
,
H. V.
, &
Verdú
,
S.
(
2002
).
The capacity of the noncoherent Rician fading channel
(
Tech. Rep.
).
Princeton, NJ
:
Princeton University
.
Gursoy
,
M. C.
,
Poor
,
V.
, &
Verdú
,
S.
(
2005
).
The noncoherent Rician fading channel—Part I: Structure of the capacity-achieving input
.
IEEE Transactions on Wireless Communications
,
4
,
2193
2206
.
Ikeda
,
K.
(
2005
).
Information geometry of interspike intervals in spiking neurons
.
Neural Computation
,
17
,
2719
2735
.
Luenberger
,
D. G.
(
1969
).
Optimization by vector space method
.
Hoboken, NJ
:
Wiley
.
MacKay
,
D. M.
, &
McCulloch
,
W. S.
(
1952
).
The limiting information capacity of a neuronal link
.
Bull. Math. Biophys.
,
14
,
127
135
.
Mainen
,
Z. F.
, &
Sejnowski
,
T. J.
(
1995
).
Reliability of spike timing in neocortical neurons
.
Science
,
268
,
1503
1506
.
McDonnell
,
M. D.
, &
Stocks
,
N. G.
(
2008
).
Maximally informative stimuli and tuning curves for sigmoidal rate-coding neurons and populations
.
Physical Review Letters
,
101
,
058103
.
Miura
,
K.
,
,
M.
, &
Amari
,
S.
(
2006
).
Estimating spiking irregularities under changing environments
.
Neural Computation
,
18
,
2359
2386
.
Nikitin
,
A. P.
,
Stocks
,
N. G.
,
Morse
,
R. P.
, &
McDonnell
,
M. D.
(
2008
).
Neural population coding is optimized by discrete tuning curves
.
Manuscript submitted for publication
.
Pawlas
,
Z.
,
Klevanov
,
L. B.
, &
Prokop
,
M.
(
2008
).
Parameters of spike trains observed in a short time window
.
Neural Computation
,
20
,
1325
1343
.
Rapoport
,
A.
, &
Horvath
,
W. J.
(
1960
).
The theoretical channel capacity of a single neuron as determined by various coding systems
.
Information and Control
,
3
,
335
350
.
Rullen
,
R. V.
, &
Thorpe
,
S. J.
(
2001
).
Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex
.
Neural Computation
,
13
,
1255
1283
.
Shamai (Shitz)
,
S
(
1990
).
Capacity of a pulse amplitude modulated direct detection photon channel
.
IEE Proceedings
,
137
,
424
430
.
Shannon
,
C. E.
(
1948
).
A mathematical theory of communication
.
Bell System Technical Journal
,
27
,
379
423
,
623–656
.
Shinomoto
,
S.
,
Miyazaki
,
Y.
,
Tamura
,
H.
, &
Fujita
,
I.
(
2005
).
Regional and laminar difference in in vivo firing patterns of primate cortical neurons
.
J. Neurophysiol
,
94
,
567
575
.
Shinomoto
,
S.
,
Shima
,
K.
, &
Tanji
,
J.
(
2003
).
Differences in spiking patterns among cortical neurons
.
Neural Computation
,
15
,
2823
2842
.
Smith
,
J. G.
(
1971
).
The information capacity of amplitude- and variance-constrained scalar gaussian channels
.
Information and Control
,
18
,
203
219
.
Stein
,
R. B.
(
1967
).
The information capacity of nerve cells using a frequency code
.
Biophysical Journal
,
797
826
.
Tchamkerten
,
A.
(
2004
).
On the discreteness of capacity-achieving distributions
.
IEEE Transactions on Information Theory
,
50
,
2773
2778
.
Verdú
,
S.
(
1999
).
Poisson communication theory
.
Invited talk in the International Technion Communication Day in honor of Israel Bar-David. Available online at http://www.princeton.edu/~verdu
.