## Abstract

Signal-to-noise ratios in physical systems can be significantly degraded if the outputs of the systems are highly variable. Biological processes for which highly stereotyped signal generations are necessary features appear to have reduced their signal variabilities by employing multiple processing steps. To better understand why this multistep cascade structure might be desirable, we prove that the reliability of a signal generated by a multistate system with no memory (i.e., a Markov chain) is maximal if and only if the system topology is such that the process steps irreversibly through each state, with transition rates chosen such that an equal fraction of the total signal is generated in each state. Furthermore, our result indicates that by increasing the number of states, it is possible to arbitrarily increase the reliability of the system. In a physical system, however, an energy cost is associated with maintaining irreversible transitions, and this cost increases with the number of such transitions (i.e., the number of states). Thus, an infinite-length chain, which would be perfectly reliable, is infeasible. To model the effects of energy demands on the maximally reliable solution, we numerically optimize the topology under two distinct energy functions that penalize either irreversible transitions or incommunicability between states, respectively. In both cases, the solutions are essentially irreversible linear chains, but with upper bounds on the number of states set by the amount of available energy. We therefore conclude that a physical system for which signal reliability is important should employ a linear architecture, with the number of states (and thus the reliability) determined by the intrinsic energy constraints of the system.

## 1. Introduction

In many physical systems, a high degree of signal stereotypy is desirable. In the retina, for example, the total number of G proteins turned on during the lifetime of activated rhodopsin following a photon absorption event needs to have a low variability to ensure that the resulting neural signal is more or less the same from trial to trial (Rieke & Baylor, 1998). If this were not the case, accurate vision in low-light conditions would not be possible. Biology offers a myriad of other examples where signal reproducibility or temporal reliability is necessary for proper function: muscle fiber contraction (Edmonds, Gibb, & Colquhoun, 1995b), action potential generation and propagation (Kandel, Schwartz, & Jessell, 2000), neural computations underlying motor control (Olivier, Davare, Andres, & Fadiga, 2007) or time estimation (Buhusi & Meck, 2005), ion channel and pump dynamics (Edmonds, Gibb, & Colquhoun, 1995a), circadian rhythm generation (Reppert & Weaver, 2002), cell-signaling cascades (Locasale & Chakraborty, 2008), and others. In some cases, it may be possible to reduce signal variability by making a system exceedingly fast, but in many cases, a nonzero mean processing time is necessary. The mechanism involved in the inactivation of rhodopsin, for example, needs to have some latency in order for enough G proteins to accumulate to effect the neural signal. In this article, we address the question of how to design a physical system that has a low signal variability while maintaining some desired nonzero mean total signal (and thus a nonzero mean processing time).

A previous numerical study of the variability of the signal generated during the
lifetime of activated rhodopsin found that a multistep inactivation procedure (with
the individual steps proposed to be sequential phosphorylations) was required to
account for the low variability observed experimentally (Hamer, Nicholas, Tranchina,
Liebman, & Lamb, 2003). This theoretical
prediction was borne out when knockouts of phosphorylation sites in the rhodopsin
gene were seen to result in increased variability (Doan, Mendez, Detwiler, Chen,
& Rieke, 2006). These results led us to
consider more generally whether a multistep system is optimal in terms of the
reliability of an accumulated signal. Specifically, we limit ourselves to consider
memoryless systems where the future evolution of the system dynamics depends on the
current configuration of the system but not simultaneously on the history of past
configurations. If such a memoryless system has a finite or countable number of
distinct configurations (states) with near-instantaneous transition times between
them, it can be modeled as a continuous-time Markov chain. This class of models,
though somewhat restricted, is sufficiently rich to adequately approximate a wide
variety of physical systems, including the phosphorylation cascade employed in the
inactivation of rhodopsin. By restricting ourselves to systems that can be modeled
by Markov chains, our goal of identifying the system design that minimizes the
variance of the total generated signal while maintaining some nonzero mean may be
restated as the goal of determining the Markov chain network topology that meets
these requirements given a set of state-specific signal accumulation rates.^{1} This is the primary focus of the work presented here.

The article is organized as follows. In section 2, we review basic continuous-time Markov chain theory, introduce our
notation, and review the necessary theory of the “hitting time,” or
first passage time, between two states in a Markov network. We then define a random
variable to represent the total signal generated during the path between the first
and last states in the network and show that this is a simple generalization of the
hitting time itself. The squared coefficient of variation of this variable (the
CV^{2}, or ratio of the variance to the square of the mean) will be our
measure of the variability of the system modeled by the Markov chain. In section 3, we present our main theoretical result
regarding the maximally reliable network topology. Simply stated, we prove that a
linear Markov chain with transition rates between pairs of adjacent states that are
proportional to the state-specific signal accumulation rates is optimal in that it
minimizes the CV^{2} of the total generated signal. In the special case that
the state-specific signal accumulation rates are all equal to one, the total
generated signal is the hitting time, and the optimally reliable solution is a
linear chain with the same transition rate between all adjacent states (see Figure 1b). As an intermediate step, we also prove
a general bound regarding the signal reliability of an arbitrary Markov chain (see
equation 3.3), which we show to be
saturated only for the optimal topology. In section 4, we numerically study the deviations from the optimal solution when
additional constraints are applied to the network topology. Specifically, we develop
cost functions that are meant to represent the energy demands that a physical system
might be expected to meet. As the available “energy” is reduced, the
maximally reliable structure deviates further and further from the optimal (i.e.,
infinite energy) solution. If the cost function penalizes a quantity analogous to
the Gibbs free energy difference between states, then the resulting solution is
composed of two regimes: a directed component, which is essentially an irreversible
linear subchain, followed by a diffusive component, where the forward and backward
transition rates between pairs of states along the chain become identical (see
section 4.2). In the zero energy limit, the
maximally reliable solution is purely diffusive, which is a topology amenable to
analytic interpretation (see section 4.2.1). If instead the cost function penalizes all near-zero transition
rates, then states are seen to merge until, at the minimum energy limit, the
topology reduces to a simple two-state system (see section 4.3). In sections 4.4 and 4.5, we present a brief
analytic comparison of the solutions given by the two energy functions to show that
although they superficially seem quite different, they are in fact analogous. In
both cases, the amount of available energy sets a maximum or effective maximum
number of allowable states, and within this state space, the maximally reliable
Markov chain architecture is a linear chain with transition rates between each pair
of adjacent states that are proportional to the state-specific signal accumulation
rates. Finally, in section 4.6, we argue that
structure is necessary for reliability and that randomly connected Markov chains do
not confer improved reliability with increased numbers of states. From this, we
conclude that investing the energy resources needed to construct a linear Markov
chain would be advantageous to a physical system.

## 2. Continuous Time Markov Chains

A Markov chain is a simple model of a stochastic dynamical system that is assumed to transition between a finite or countable number of states (see Figure 1a). Furthermore, it is memoryless—the future is independent of the past given the present. This feature of the model is called the Markov property.

_{ij}:∀

*i*, ∀

*j*≠

*i*} that describe the dynamics of the network, where λ

_{ij}is the rate of transition from state

*j*to state

*i*. The dwell time in each state prior to a transition to a new state is given by an exponentially distributed random variable, which appropriately captures the Markovian nature of the system. Specifically, the dwell time in state

*j*is given by an exponential distribution with time constant τ

_{j}, where the inverse of the sum of all of the transition rates away from state

*j*. Once a transition away from state

*j*occurs, the probability

*p*

_{j→i}that the transition is to state

*i*is given by the relative value of λ

_{ij}compared to the other rates of transitions leaving state

*j*. Specifically, It is convenient to construct an

*N*×

*N*transition rate matrix to describe a homogeneous continuous-time Markov chain as follows: where

*q*= −1/τ

_{j}_{j}. Note that each column of

**A**sums to zero and that all off-diagonal elements (the transition rates) are nonnegative. The set of all

*N*×

*N*matrices of this form corresponds to the full set of

*N*-state continuous-time Markov chains. For an introduction to the general theory of Markov chains, see, for example, Norris (2004).

### 2.1. Hitting Times and Total Generated Signals.

*N*given that it starts in state 1.

^{2}This time is often referred to as the hitting time, which can be represented by the random variable

*t*

_{1N}. The total generated signal,

*F*

_{1N}, can subsequently be defined in terms of the hitting time and the state-specific signal accumulation rates. If the rate of signal accumulation in state

*i*is given by the coefficient

*f*, and the current state at time

_{i}*t*is denoted

*q*(

*t*), then we can define the total signal as Note that in the case that

*f*= 1, ∀

_{i}*i*, the total generated signal equals the hitting time. The statistics of the random variables

*t*

_{1N}and

*F*

_{1N}are governed by the topology of the network, namely, the transition matrix

**A**. Our goal is to identify the network topology that minimizes the variance of the total signal and thus maximizes the reliability of the system, while holding the mean total signal constant.

*N*at time

*t*is given as where and (i.e., the first and

*N*th standard basis vectors, respectively) (Norris, 2004). If the

*N*th column of

**A**is a zero vector so that transitions away from state

*N*are disallowed, making

*N*a so-called collecting state of the system, then

*p*(

*q*(

*t*) =

*N*) is equivalent to the probability that the hitting time

*t*

_{1N}is less than

*t*. Assuming that this is true,

^{3}then the time derivative of equation 2.5 is the probability density of the hitting time itself: Note that state

*N*must be collecting in order for this distribution to integrate to 1. Additionally,

*p*(

*t*

_{1N}) is well defined only if state

*N*is both accessible from state 1 and is the only collecting state or collecting set of states accessible from state 1. We consider only topologies for which these three properties hold.

*t*

_{1N}is equivalent to studying the statistics of the the total generated signal

*F*

_{1N}since the two random variables are simple transforms of each other. To determine the probability distribution of

*F*

_{1N}, we can consider the signal accumulation rates to simply rescale time. The transition rate λ

_{ij}can be stated as the number of transitions to state

*i*per unit of accumulated time when the system is in state

*j*, and so the ratio λ

_{ij}/

*f*can similarly be stated as the number of transitions to state

_{j}*i*per unit of accumulated signal when the system is in state

*j*. Thus, by dividing each column of by the corresponding signal accumulation rate, we can define the matrix with elements ϕ

_{ij}≡ λ

_{ij}/

*f*. Then the probability distribution of

_{j}*F*

_{1N}is given, by analogy with the hitting time distribution (see equation 2.6), as Thus, for the remainder of the article, we focus solely on the reliability of a Markovian system as measured by the statistics of the hitting time rather than of the total generated signal (i.e., we consider

*f*= 1, ∀

_{i}*i*). This can be done without loss of generality since the results we present regarding the reliability of the hitting time can be translated to an equivalent set of results regarding the reliability of the total generated signal by a simple column-wise rescaling of the transition rate matrices.

### 2.2. The *CV*^{2} as a Measure of Reliability.

^{2}

It is clear that given a Markov chain with a set of fixed relative transition
rates, the reliability of the system should be independent of the absolute
values of the rates (i.e., the scale) since a scaling of the rates would merely
rescale time (e.g., change the units from seconds to minutes). Furthermore, the
variance and the square of the mean of the hitting time *t*_{1N} would be expected to vary in proportion to each other given a scaling of
the transition rates, since again this is just a rescaling of time. This can be
demonstrated by noting, from equation 2.3, that scaling the rates of a Markov chain by the same factor is
equivalent to scaling , since is linear in the rates λ_{ij}, and, from equation 2.6,
that scaling is equivalent to rescaling *t*_{1N} and thus the statistics of *t*_{1N}. Therefore, we use the squared coefficient of variation (CV^{2},
or the dimensionless ratio of the variance to the square of the mean) to measure
the reliability of a Markov chain and seek to determine the network topology
(i.e., with fixed relative rates, but not fixed absolute rates), which minimizes
the CV^{2} and thus is maximally reliable.

## 3. Optimal Reliability

_{i+1,i}= λ for all

*i*and for some constant rate λ, and λ

_{ij}= 0 for

*j*≠

*i*− 1; see Figure 1b) may be optimal. For such a chain, the hitting time

*t*

_{1N}equals the sum from 1 to

*M*(where we define

*M*≡

*N*− 1 for convenience) of the dwell times in each state of the chain

*t*

_{i,i+1}. This gives the CV

^{2}as where we use the fact that the dwell times are independent random variables, and so their means and variances simply add. Since the

*t*

_{i,i+1}are drawn from an exponential distribution with mean 1/λ and variance 1/λ

^{2}, the CV

^{2}reduces further as

^{2}appears to be less obvious. The main mathematical result of this article is that no other topologies reach a CV

^{2}of 1/

*M*. Our proof, detailed in appendix A, proceeds in two steps. First, we prove that the following bound holds for all

*N*-state Markov chains: Second, we show that our proposed constant-rate linear chain is the unique solution that saturates this bound.

Confirming the relevance of this theoretical result to natural systems, the best fit
of a detailed kinetic model for rhodopsin inactivation to experimental data has
exactly the constant-rate linear chain architecture, although for the total
generated signal rather than for the lifetime of the system (i.e., in each
phosphorylation state, the rate of subsequent phosphorylation is proportional to the
state-specific G protein activation rate, and so the mean fraction of the total
signal accumulated in each state is constant) (Gibson, Parkes, & Liebman, 2000; Hamer et al., 2003). We postulate that studies of other biological
systems for which temporal or total signal reliabilities are necessary features will
uncover similar constant-rate linear topologies. Although not experimentally
validated, previous theoretical work (Miller & Wang, 2006) has shown that a constant-rate linear chain could
be implemented by the brain as a potential mechanism for measuring an interval of
time. Specifically, if a set of strongly intraconnected populations of bistable
integrate-and TXtable -fire neurons is weakly interconnected in a series, then the
total network works like a stochastic clock (i.e., in the presence of noise). By
activating the first population through some external input, each subsequent
population is activated in turn after some delay given by the strength of the
connectivity between the populations. The time from the activation of the first
population until the last is equivalent to the hitting time in a linear Markov chain
with each population representing a state. Interestingly, Miller and Wang (2006) use this timing mechanism as a way to
explain the well-known adherence to Weber's law seen in the behavioral data for
interval timing (Gibbon, 1977; Buhusi &
Meck, 2005), while our result indicates that
this timing architecture is optimal without reference to the data.^{4}

## 4. Numerical Studies of Energy Constraints

Given the inverse relationship between the CV^{2} of the hitting time (or the
total generated signal) and the number of states for a system with a linear,
unidirectional topology (see equation 3.2), it would seem that such a system may be made arbitrarily reliable
by increasing the number of states. Why, then, do physical systems not employ a
massively large number of states to essentially eliminate trial-to-trial variability
in signal generation? The inactivation of activated rhodopsin, for example, appears
to be an eight-state (Doan et al., 2006) or
nine-state (Hamer et al., 2003) system. Why
did nature not increase this number to hundreds of thousands of states? In the case
of rhodopsin, one might speculate that the reduction in variability achieved with
only eight or nine states is sufficient to render negligible the contribution that
the variability in the number of G proteins generated by rhodopsin adds to the total
noise in the visual system (i.e., it is small compared to other noise components
such as photon shot noise and intrinsic noise in the retinothalamocortical neural
circuitry); more generally, for an arbitrary system, it is reasonable to hypothesize
that a huge number of states is infeasible due to the cost incurred in maintaining
such a large state-space. We will attempt to understand this cost by defining a
measure of “energy” over the topology of the system.

The optimal solution given in the previous section consists entirely of irreversible
transitions. We can analyze the energetics of such a topology by borrowing from
Arrhenius kinetic theory and considering transitions between states in the Markov
chain to be analogous to individual reactions in a chemical process. An irreversible
reaction is associated with an infinite energy drop, and thus our optimal topology
is an energetic impossibility. Even if one deviates slightly from the optimal
solution and sets the transitions to be nearly, but not perfectly, irreversible,
then each step is associated with a large, though finite, energy drop. Thus, the
total energy required to reset the system following its progression from the first
to the final state would equal the sum of all of the energy drops between each pair
of states. In this context, it is apparent why a physical system could not maintain
the optimal solution with a large number of states *N* since each
additional state would add to the total energy drop across the length of the chain.
At some point, the cost of an additional state would outweigh the benefit in terms
of reduced variability, and thus the final topology would have a number of states
that balances the counteracting goals of variability reduction and conservation of
energy resources.

Specifically, in Arrhenius theory, energy differences are proportional to the
negative logarithms of the reaction rates (i.e., Δ*E* ∝
−ln λ). In sections 4.2 and 4.3, we define two different energy
functions, both consistent with this proportionality concept, but apply it
differently and have different interpretations. We then numerically optimize the
transition rates of an *N*-state Markov chain to minimize the
CV^{2} of the hitting time while holding the total energy *E*_{tot} constant. This process is repeated for many values of *E*_{tot} to understand the role that the energy constraints defined by the two
different energy functions play in determining the minimally variable solution. As
expected and shown in the results here, the CV^{2} of the optimal solution
increases with decreasing *E*_{tot}.

### 4.1. Numerical Methods.

Constrained optimization was performed using the optimization toolbox in Matlab.
Rather than minimize the CV^{2} of the hitting time directly, the
variance was minimized while the mean was constrained to be 1, thus making the
variance equal to the CV^{2}. Expressions for the mean and the variance
of the hitting time for arbitrary transition rate matrices are given using the
known formula for the moments of the hitting time distribution (Norris, 2004; see appendix B for a derivation). In
order to implicitly enforce the constraint that the rates must be positive, the
rates λ_{ij} were reparameterized in terms of θ_{ij}, where θ_{ij} ≡ −ln λ_{ij}. The variance was then minimized over the new parameters rather than the
rates themselves. The gradients of these functions with respect to ** θ** were also used to speed the convergence of the optimization routine. The
gradients of the mean, variance, and energy functions are given in appendix
C.

For each value of *E*_{tot}, parameter optimization was repeated with multiple initial
conditions to both discard locally optimal solutions and avoid numerically
unstable parameter regimes. For the second energy function (see section 4.3), local optima were encountered, while
none were observed in the parameter space of the first (see section 4.2).

### 4.2. Energy Cost Function I: Constrain Irreversibility of Transitions.

Our first approach at developing a reasonable energy cost function is predicated
on the idea that if the values of the reciprocal rates between a single pair of
states (e.g., λ_{ij} and λ_{ji} for states *i* and *j*) are unequal, then
this asymmetry should be penalized, but that the penalty should not depend on
the absolute values of the rates themselves. Thus, if two states have equal
rates between them (including zero rates), no penalty is incurred, but if the
rates differ significantly, then large penalties are applied. In other words,
perfectly reversible transitions are not penalized, while perfectly irreversible
transitions are infinitely costly. From an energy standpoint, different rates
between a pair of states can be thought of as resulting from a quantity
analogous to the Gibbs free energy difference between the states. In chemical
reactions, reactants and products have intrinsic energy values defined by their
specific chemical makeups. The difference between the product and the reactant
energies is the Gibbs free energy drop of a reaction. If this value is negative,
then the forward reaction rate exceeds the backward rate, and vice versa if the
value is positive. By analogy then, we can consider each state in a Markov chain
to be associated with an energy, and thus the relative difference between the
transition rates in a reciprocal rate pair is due to the energy drop between
their corresponding states.^{5} For nonzero energy differences, one of the rates is fast because it is
associated with a drop in energy, while the other is slow since energy must be
invested to achieve the transition. If the energy difference is zero, then both
rates are identical. This idea is schematized in Figure 2a where the energy drop
Δ*E _{ij}* between states

*i*and

*j*results in a faster rate λ

_{ji}than λ

_{ij}.

*E*

_{tot}can then be given as the sum of the energy drops between every pair of states: where we exclude pairs that contain state

*N*(i.e., since the outgoing rates for transitions away from state

*N*do not affect the hitting time

*t*

_{1N}and thus can always be set to equal the reciprocal incoming rates to state

*N*, making those energy drops zero) and count each rate pair only once (because |Δ

*E*| = |Δ

_{ij}*E*|). From Arrhenius kinetic theory, the Gibbs free energy difference is proportional to the logarithm of the ratio of the forward and backward reaction rates, and so we use the following definition for the energy drop (plotted in Figure 2b for a single pair of reciprocal rates): Therefore, the complete energy cost function is Note that the individual magnitudes of the rates do not enter into the energy function, only the relative magnitudes between pairs of reciprocal rates.

_{ji}The results of numerical optimization of the rates λ_{ij} to minimize the CV^{2} of the hitting time *t*_{1N} under the energy function defined in equation 4.3 are given in Figure 3a. At large values of *E*_{tot}, the optimized solution approaches the asymptotic CV^{2} limit of (i.e., the CV^{2} of the unconstrained or infinite
energy, ideal linear chain; see equation 3.2). As the available energy is decreased,
the CV^{2} consequently increases until, at *E*_{tot} = 0, the CV^{2} reaches a maximal level of (the function ξ(*M*) will be defined in
section 4.2.1).

Upon inspection, for large values of the total energy, the optimized transition
rate matrix is seen, as presumed, to be essentially identical to the ideal,
infinite energy solution. Specifically, the forward transition rates along the
linear chain (i.e., the elements of the lower subdiagonal of the transition rate
matrix; see equation 2.3) are
all essentially equal to each other, while the remaining rates are all
essentially zero. Since the energy function does not penalize symmetric
reciprocal rate pairs, the reciprocal rates between nonadjacent states in a
linear chain (which are both zero and thus symmetric) would not contribute to
the energy. Thus, it would be expected that the optimal solutions found using
this energy function would be linear chains, and indeed the minimization
procedure does drive rates between nonadjacent states to exactly zero, or, more
precisely, the numerical limit of the machine. The only deviations away from the
ideal, infinite energy solution occur in the rates along the lower and upper
subdiagonals of the transition rate matrix (i.e., the forward and backward rates
between adjacent states in the linear chain). As the available energy is
decreased, these deviations become more pronounced until the lower and upper
subdiagonals become equal to each other at *E*_{tot} = 0. An analytical treatment of the structure of this zero
energy solution is given in section 4.2.1.

An inspection of the transition rate matrix at intermediate values of *E*_{tot} reveals that as the minimum CV^{2} solution deviates
between the infinite energy and the zero energy optima, the pairs of forward and
backward transition rates between adjacent states become equal, and thus give no
contribution to *E*_{tot}, in sequence, starting from the last pair in the chain (λ_{M−1,M} and λ_{M,M−1}) at some relatively
high energy value and ending with the first pair (λ_{12} and
λ_{21}) at *E*_{tot} = 0. This sequential merging of rate pairs, from final
pair to first pair, with a decrease in the available energy was a robust result
over all network sizes tested. In Figure 3b, for example, the results are shown for the optimization of an
eight-state Markov chain. It is clear from the figure that the transition rates
deviate smoothly from the infinite energy ideal as *E*_{tot} is decreased, until the final rate pair (denoted with the
•) merges together at the energy level *E*_{6}. At all lower energy values, these two rates remain identical and
thus, given the definition of the energy function (see equation 4.3), are noncontributory to *E*_{tot}. After this first merging, the rates again deviate smoothly with
decreasing *E*_{tot} until the next rate pair (★) merges. This pattern repeats
itself until, at *E*_{tot} = 0, all rate pairs have merged. The value of the unpaired
rate at the end of the chain (λ_{87} in this case) as a function
of the available energy is shown in the figure inset. At intermediate values of *E*_{tot}, the as-yet unmerged rate pairs (except for the first rate pair,
λ_{12} and λ_{21}) are all identical to each
other. That is, all of the forward rates in these unmerged rate pairs are equal,
as are all of the backward rates. For example, above *E*_{6} in Figure 3b, the forward
rates λ_{32}, …, λ_{65} are equal, as are
the backward rates λ_{23}, …, λ_{56}. In
other words, the ▴, ▾, ⧫, and ★ traces lie exactly
on top of each other. Only the • traces, corresponding to the rate pair
that is actively merging in this energy range, and the ▪ traces,
corresponding to the first rate pair, deviate from the other forward and
backward rates. Between *E*_{5} and *E*_{6}, the same unmerged rates continue to be equal except for
λ_{56} and λ_{65} (★), which have begun
to merge.^{6} Essentially this means that the behavior of the system at intermediate
values of *E*_{tot} is insensitive to the current position along the chain anywhere
within this set of states with unmerged rate pairs (i.e., the forward and
backward rates are the same between all adjacent states in this set).

We can understand why rate pairs should merge by considering the energy function
to be analogous to a prior distribution over a set of parameters in a
machine-learning-style parameter optimization (e.g., a maximum a posteriori
fitting procedure). In this case, the parameters are the logarithms of the
ratios of the pairs of rates and the prior distribution is the Laplace, which,
in log space, gives the *L*_{1}-norm of the parameters (i.e., exactly the definition of the
individual terms of *E*_{tot}; see equation 4.3). As is well known from the machine-learning literature, a Laplace
prior or, equivalently, an *L*_{1}-norm regularizer, gives rise to a sparse representation where
parameters on which the data are least dependent are driven to zero and thus
ignored, while those that capture important structural features of the data are
spared and remain nonzero (Tibshirani, 1996). In this analogy, *E*_{tot} is similar to the standard deviation of the prior distribution
(or the inverse of the Lagrange multiplier of the regularizer), in that, as it
is decreased toward zero, it allows fewer and fewer nonzero log rate ratios to
persist. Ultimately, at *E*_{tot} = 0, the prior distribution overwhelms optimization of the
CV^{2}, and all the pairs of rates are driven to be equal, thus
making the log rate ratios zero. This analogy might lead one to consider energy
functions that correspond to other prior distributions, but, unlike equation 4.3, functions based on
other priors (e.g., a quadratic that would correspond to a gaussian prior) do
not result in a clear interpretation of what the energy means, and thus they
were not pursued in this work.

One interpretation of the solutions of the optimization procedure at different
energy values shown in Figure 3b is as
follows. Before a rate pair merges, the corresponding transition can thought of
as *directed* with the probability of a forward transition
exceeding that of a backward transition. After a merger has taken place, the
probabilities of going forward and backward become equal, and we term this
behavior *diffusive*. At high values of *E*_{tot}, the solution is entirely directed, with the system marching from
the first state to the final state in sequence. At *E*_{tot} = 0, the solution is purely diffusive, with the system
performing a random walk along the Markov chain. At intermediate energy values,
both directed and diffusive regimes coexist. Interestingly, the directed regime
always precedes the diffusive regime (i.e., the rate pairs toward the end of the
chain merge at higher energy values than those toward the beginning of the
chain). Recalling our analogy from the previous paragraph, the first parameters
to be driven to zero using a Laplace prior are those that have the least impact
in accurately capturing the data. Therefore, in our case, we expect that the
first log rate ratios driven to zero are those that have the least impact on
minimizing the CV^{2} of the hitting time *t*_{1N}. Thus, our numerical results indicate that at energy levels where a
completely directed solution is not possible, it is better, in terms of
variability reduction, to first take directed steps and then diffuse rather than
diffuse and then take directed steps or mix the two regimes arbitrarily. We
present a brief interpretation as to why this structure is favored in section 4.2.2. A schematic of an
intermediate energy solution is shown in Figure 4.

#### 4.2.1. Zero Energy or Pure Diffusion Solution.

If *E*_{tot} is zero under the energy function given by equation 4.3, then all pairs of rates λ_{ij} and λ_{ji} are forced to be equal. The rates corresponding to transitions
between nonadjacent states in the linear chain (i.e., for
|*i* − *j*| ≠ 1) are driven
to zero by the optimization of the CV^{2}, while the adjacent state
transition rates remain positive. It is possible to analytically solve for
the rates in such a zero energy chain as well as find a semiclosed-form
expression for the CV^{2} of the hitting time *t*_{1N} (see appendix D for details).

_{i}for

*i*∈ {1, …,

*M*} where λ

_{i}≡ λ

_{i,i+1}= λ

_{i+1,i}. Then the CV

^{2}can be shown to be where we have defined the vector as , the matrix as

*Z*≡ min(

_{ij}*i*,

*j*)

^{2}, and, for notational convenience,

*T*as 〈

*t*

_{1N}〉. The λ

_{i}that minimize this CV

^{2}are which, substituted back into equation 4.4, give the minimum CV

^{2}as The function ξ(

*M*) is calculated as where Ψ(

*x*) is the digamma function defined as the derivative of the logarithm of the gamma function (i.e., ) and γ is the Euler–Mascheroni constant. Although Ψ(

*x*) has no simple closed-form expression, efficient algorithms exist for determining its values. For

*M*= 1, ξ(1) = 1, and, as can be shown by asymptotic expansion of Ψ(

*x*), ξ(

*M*) grows logarithmically with

*M*.

Compared to the CV^{2} versus number-of-states relationship at
infinite energy (see equation 3.2) in the zero energy setting, the CV^{2} scales
inversely with log *N* (see equation 4.6) rather than with *N*,
and thus adding states gives logarithmically less advantage in terms of
variability reduction. Furthermore, to achieve even this modest improvement
with increasing *N*, the rates must scale with *i*^{2} (i.e., λ_{i} ∝ 4*i*^{2} − 1; see equation 4.5), and thus the rates toward the end
of the chain need to be *O*(*N*^{2}ln *N*), while those near the start are only *O*(ln*N*). To summarize, then, the zero
energy setting has two disadvantages over the infinite energy case. First,
for the optimal solution, the CV^{2} is inversely proportional to
the logarithm of *N* rather than to *N* itself, and second, even to achieve this modest variability reduction, a
dynamic range of transition rates proportional to *N*^{2} must be maintained.

#### 4.2.2. The Diffusive Regime Follows the Directed Regime.

We can understand the numerical result that, at intermediate energy values,
the diffusive regime always follows the directed regime by careful
consideration of the structure of this solution. First, let us assume that
for some value of *E*_{tot}, the directed regime consists of *M _{i}* directed transitions and that the remainder of the system consists
of a purely diffusive tail with

*M*transitions (where

_{r}*N*=

*M*+

_{i}*M*+ 1). Recalling that transitions to state

_{r}*N*are always unpaired, then there are actually

*M*+ 1 directed transitions for this intermediate energy solution:

_{i}*M*in the directed regime and one at the end of the diffusive tail. However, the energy resources of the system are being devoted solely to maintain the

_{i}*M*transitions composing the directed regime, since the energy function (see equation 4.3) does not penalize perfectly reversible transitions or transitions leading to state

_{i}*N*, such as the one at the end of the diffusive tail.

Now consider if the diffusive regime preceded the directed regime. Then,
although there would still be *M _{i}* + 1 directed transitions (one at the end of the diffusive regime
leading into the directed regime and

*M*in the directed regime), the energy resources would be apportioned in a new manner. The final transition of the directed regime, since it leads to state

_{i}*N*, would not incur any penalty, while the final transition of the diffusive regime would incur a penalty since it now leads to the first state of the directed regime rather than to state

*N*. In other words, the final transition of the diffusive regime is penalized, as are the first

*M*− 1 transitions of the directed regime. It is now possible to understand why our numerical optimizations always yield solutions with directed-first, diffusive-second architectures. If the transition rate at the end of the diffusive regime λ

_{i}_{Mr}(to use the notation introduced in section 4.2.1) is greater than the transition rate at the end of the directed regime λ, then more energy would be required for the diffusive-first architecture, which would penalize λ

_{Mr}, than for the directed-first architecture, which does not. Numerically, λ

_{Mr}is always seen to be greater than λ, and the following simple analysis also supports this idea.

*M*perfectly irreversible transitions with backward rates of exactly zero, then the directed and diffusive subchains can be considered independently, and thus their variances can be added as where we have multiplied the expressions for the CV

_{i}^{2}of an ideal linear chain (see equation 3.2) and a zero energy, purely diffusive chain (see equation 4.6) by the squares of the mean processing times for each subchain (

*T*and

_{i}*T*) to get the variances. In order to find the relative rates between the directed and diffusive portions of the chain, we minimize equation 4.8 with respect to the subchain means subject to the constraint that the mean total time is

_{r}*T*. This gives and The forward rate along the directed portion of the chain is thus and the final rate along the diffusive portion (see equation 4.5) is thus

Therefore, for the case of a perfectly irreversible directed regime, the
final transition rate of the diffusive regime is always larger than the rate
of the directed regime as long as *M _{r}* > 1. Although this result does not necessarily hold for
real intermediate energy solutions (where the directed regime is not
perfectly irreversible), this analysis seems to explain the numerical result
that the directed-first architecture is optimal.

### 4.3. Energy Cost Function II: Constrain Incommunicability Between States.

Although the results from the previous section are revealing and provide insight
into why a physical system might be limited in the number of directed steps it
can maintain (as discussed in section 4.4, the diffusive tail found at intermediate values of *E*_{tot} in the previous section is essentially negligible in terms of
variability reduction), it is unclear whether the energy cost function given in
equation 4.3 is generally
applicable to an arbitrary multistate physical process. Therefore, as a test of
the robustness of our results, we defined an additional cost function to
determine the behavior of the optimal solution under a different set of
constraints. As shown in sections 4.4 and 4.5, the results given by our second
cost function, while superficially appearing to be quite different, are in fact
analogous to those given in the preceding section.

*i*and

*j*. The energy

*E*can be thought of as the energy needed to permit the transition from

_{ji}*i*to

*j*, and similarly for

*E*. Given our intuition regarding the relationship between energies and rates, from the diagram one expects that the rate λ

_{ij}_{ji}is faster than the rate λ

_{ij}, since

*E*is less than

_{ji}*E*. The total energy in the system

_{ij}*E*

_{tot}can simply be defined as the sum of energies associated with each transition in the system: The energies of the transitions originating in state

*N*are excluded from the preceding sum since their associated transition rates do not affect the hitting time

*t*

_{1N}(i.e., transitions away from state

*N*are irrelevant).

*E*→ ∞, and, for large transition rates,

_{ij}*E*→ 0, which corresponds to our intuition from the previous paragraph. The following definition for the transition energy, plotted in Figure 2d, meets these two conditions: Therefore, the sceond energy cost function is Our results are insensitive to the exact definition of the function as long as the asymptotic behaviors at transition rates of zero and infinity are retained.

_{ij}_{ij}to minimize the CV

^{2}for a five-state Markov chain are given in Figure 5a. For large values of

*E*

_{tot}, the optimal solution is represented by the ▪ trace. This solution asymptotes to , which is the theoretical minimum for

*N*= 5 (see equation 3.2). Thus, the optimized transition rate matrix looks essentially identical to the ideal, infinite energy solution: where

*T*≡ 〈

*t*

_{1N}〉. Since

*E*

_{tot}is finite, transition rates of exactly zero are not possible, but for large enough values of

*E*

_{tot}, the rates given as zero in equation 4.16 are in fact optimized to near-zero values. Note that this is a different behavior from that seen in the previous section, where reciprocal rates between nonadjacent states in the linear chain were optimized to exactly zero (or, at least, the machine limit) and only the forward and backward rates along the linear chain were affected by the amount of available energy. In this case, all of the rates in the matrix are affected by

*E*

_{tot}, and the degree to which the rates given as zero in equation 4.16 deviate from true zero depends on the energy. Thus, while in the previous section, the linear architecture is maintained at all values of

*E*

_{tot}, optimization under the second energy function would be expected to corrupt the linear structure, and, indeed, as

*E*

_{tot}is decreased, all of the near-zero rates, including those between nonadjacent states in the linear chain, deviate further from zero. Concomitantly, the minimum CV

^{2}, as shown in Figure 5a, is seen to rise as expected.

*E*

_{tot}, is not globally optimal. Inspection of the solution reveals the following transition rate matrix: where, in the third column, ∞

^{2}is an infinity of a different order from the other infinities in the column (e.g., 10

^{100}versus 10

^{50}). This hierarchy of infinities is an artifact of the numerical optimization procedure, but the solution is nonetheless revealing. Essentially, states 3 and 4 are merged into a single state in this solution. Whenever the system is in state 3, it immediately transitions to state 4 because the infinity of the higher order (i.e., ∞

^{2}) dominates. From state 4, the system immediately transitions back to state 3, and thus states 3 and 4 are equivalent. There is a single outflow available from this combined state to state 5 with rate . Furthermore, there are two sources of input into states 3 and 4, both from state 2, but since the states are combined, this is the same as a single source with a total rate also of . Finally, there is an irreversible transition from state 1 to 2 with rate . This, then, is exactly the optimal solution for a four-state Markov chain: for large values of

*E*

_{tot}, every forward rate is equal to , all other rates are near zero, and the CV

^{2}asymptotes to the theoretical minimum of .

*E*

_{tot}, the transition matrix approaches which is the optimal solution for a three-state Markov chain (i.e., the forward rates are , the others rates are zero, and the asymptotic CV

^{2}is ).

*E*

_{tot}cannot both be met for arbitrary values of the two variables. There is only one rate available to the optimization procedure in a two-state system, and thus, for mean

*T*, the transition rate must be . Therefore,

*E*

_{tot}is not a free variable and is locked to ln(

*T*+ 1) by equation 4.15. This is another point of difference from the results in the preceding section where the constraint on the mean could still be satisfied when the energy was zero (i.e., when all reciprocal pairs of rates of were equal).

Analysis of the behavior of the solutions with greater than two states as the
total energy is decreased is revealing. In all cases, as expected, the minimum
values of the CV^{2} deviate from the infinite energy asymptotes, but,
more interestingly, the curves cross. At the point where the ▪ and
▴ traces cross in Figure 5a, for
example, the four-state system becomes the globally optimal solution despite the
fact that its theoretical minimum at infinite energies is higher than that of
the five-state system (i.e., ). This can be understood by considering how the available
energy that constitutes a given value of *E*_{tot} is divided up among the rates of the system. The largest
penalties are being paid for the near-zero rates, and thus most of the available
energy is apportioned to maintain them. As *E*_{tot} is decreased, maintaining the near-zero rates becomes impossible,
and so the network topology begins to deviate significantly from the infinite
energy optimum with the CV^{2} growing accordingly. This deviation
occurs at higher values of *E*_{tot} for a five-state system than for a four-state system because
there are more near-zero rates to maintain for a larger value of *N*.

Thus, we can understand the trade-off imposed on the system by the energy function given in equation 4.15. The inability to maintain a long, irreversible linear chain at decreasing energy values drives the system to discard states and focus on maintaining a linear chain of a shorter length rather than a branching or loopy chain with more states. Figures 5b to 5d show the degree to which the transition rates deviate from their optimal values for Markov chains of five, four, and three states. The energy thresholds below which a four-state chain outperforms a five-state chain and a three-state chain outperforms a four-state chain are indicated in the figures. It is clear that at these crossing points, the linear structure of the shorter chain is essentially totally intact, while that of the longer chain has been significantly degraded.

### 4.4. Comparison of Energy Functions I and II at Finite Nonminimal Energies.

The results of the optimizations under the two energy functions in the preceding
sections are illuminating. From the theoretical development of the optimal
linear Markov chain topology (see section 3), we saw that the CV^{2} of the hitting time was equal to (see equation 2.9), which suggests that a physical system can
arbitrarily improve its temporal reliability by increasing the number of states.
If, however, as in the second energy function (see equation 4.15), a cost is incurred by a system for
maintaining zero transition rates (which, functionally, results in
incommunicability between states), then, given a finite amount of available
energy, we see from section 4.3 that
there is some maximum number of states *N*_{max} achievable by the system regardless of the total allowable size
of the state space *N*. The CV^{2} is thus at best equal
to where *M*_{max} ≡ *N*_{max} − 1 (i.e., assuming that the linear chain architecture
with *N*_{max} states is essentially fully intact; see Figure 5).

*N*will always result in a lower CV

^{2}, a simple analysis reveals that an effective maximum number of states

*N*

_{eff}can be defined that is much less than

*N*itself. If, as in the analysis in section 4.2.2, one assumes that the first

*M*transitions form a perfect, irreversible linear chain and that the remainder of the system consists of

_{i}*M*fully reversible transitions (where

_{r}*N*=

*M*+

_{i}*M*+ 1), then, by combining equations 4.8, 4.9, and 4.10, the CV

_{r}^{2}is given as By comparing equation 4.20 with the CV

^{2}equation for the ideal chain (see equation 3.2), we can equate the denominators and thus define an effective number of states as Since ξ(

*M*) grows logarithmically,

*N*

_{eff}≈

*M*unless the magnitude of

_{i}*N*is on the order of

*e*

^{Mi}or greater. Furthermore, since the available energy dictates what fraction of the

*N*-state chain can be irreversible and thus the value of

*M*, then, in the absence of a massive state space, the energy is the primary determining factor in setting the temporal variability, while the value of

_{i}*N*itself is secondary.

From this, it appears that the maximally reliable solutions at finite nonminimal
energies under either energy constraint are in fact quite similar. If
irreversibility is penalized, then, as long as *N* is limited
enough such that ξ(*M _{r}*) ≪

*M*, the available energy sets the number of states to

_{i}*N*

_{eff}(≈

*M*). If, rather, incommunicability is penalized, then, regardless of how large

_{i}*N*is permitted to be, the available energy mandates that the number of states be limited to

*N*

_{max}. Furthermore, in both cases, the solutions are essentially irreversible linear chains. The only difference between the two solutions—the diffusive tail at the end of the chain optimized under the first energy function—has minimal impact on the behavior of the system.

^{7}

Figure 6a shows the relationship between the
total allowable number of states *N* and the minimum achievable
CV^{2} under the two energy cost functions where the available
energies have been tuned such that *M*_{max} and *M _{i}* are equal, finite, and nonzero. As is clear from Figure 6, although the variability does continue to decrease
as

*N*is increased past

*M*for the solutions determined under the first energy function (▴), the difference compared to the variability resulting from the second energy function (★) is minimal. The values of the CV

_{i}^{2}as functions of

*N*are shown for two different settings of

*M*

_{max}and

*M*, and although the domain of

_{i}*N*stretches over several orders of magnitude in the figure, the primary determinants of the CV

^{2}are the values of

*M*

_{max}and

*M*, not

_{i}*N*, for both settings.

### 4.5. Comparison of Energy Functions I and II at Minimal Energies.

Although optimization at finite nonminimal energy values under the two cost
functions results in similar solutions, at minimal energies, the solutions seem
quite different. Recall from section 4.2.1 that at *E*_{tot} = 0 (the minimal energy value under the first energy
function), the CV^{2} of the minimally variable solution is equal to and thus decreases toward zero with increasing *N*. Under the second energy function, however, the
CV^{2} is always equal to 1 for all *N* at the
minimal energy value (recall from section 4.3 that with mean hitting time *T*, the minimal
energy value is ln(*T* + 1) at which point all of the states have
merged leaving *N*_{max} equal to 2). These different behaviors as functions of *N* are shown in Figure 6b as the ▴ and ★, traces respectively. Although approaches zero much more slowly than the CV^{2} of
the infinite energy solution (), it is still significant compared to the CV^{2} of
the minimal energy solution under the second cost function (i.e., 1). However,
as discussed in section 4.2.1, to
achieve a CV^{2} of , the transition rates near the end of the linear chain must be
on the order of *N*^{2} times larger than the values of the rates near the beginning of the
chain.

^{2}of with increasing

*N*(the • trace in Figure 6b), and thus it is clear that only with an unrestricted range of rates can the variability be driven arbitrarily close to zero by adding states. If an unrestricted range is not feasible, then even at minimal energy values, the solutions given by the two energy cost functions are not qualitatively different. That is, both result in constant values of the CV

^{2}that are independent of

*N*(compare the ★ and • traces in Figure 6b).

### 4.6. Reliability of Random Transition Rate Matrices.

In all cases, under either energy function at any amount of available energy from
the minimal possible value to infinity, the goal of the system is to reduce the
temporal variability within the given energy constraints, and, as has been shown
throughout this article, this is achieved by choosing the maximally reliable
network structure among the set of structures that meet the constraints. Thus,
it is reasonable to consider the value of choosing an explicit structure rather
than an arbitrary random connectivity between a set of states. In Figure 7a, we show the distribution of the
CV^{2} as a function of *N*, calculated empirically
from 2000 random transition matrices at each value of *N*, with
transition rates λ_{ij} drawn as independent and identically distributed (i.i.d) samples from an
exponential distribution. As *N* increases, the distribution of
the CV^{2} quickly converges to a delta function centered at one. This
is the CV^{2} of a minimally reliable two-state system. Numerical
studies using other random rate distributions with support on the positive real
line (e.g., the uniform, gamma, log-normal) produced the same result.

*t*), the instantaneous transition rate to state

*N*, for sample evolutions of random matrices with 10, 100, and 1000 states. If the state of the system at time

*t*is given by

*q*(

*t*), then λ(

*t*) equals λ

_{Nq(t)}, the rate of transition from state

*q*(

*t*) to state

*N*. As is clear from the figures, the correlation time of λ(

*t*) goes to zero with increasing

*N*(it can be shown to scale as 1/

*N*), and so a law of large numbers averaging argument can be applied to replace λ(

*t*) with its mean (i.e., , the mean of the distribution from which the transition rates are drawn). In particular, the time-rescaling theorem (Brown, Barbieri, Ventura, Kass, & Frank, 2002) establishes that the random variable

*u*, defined as is drawn from an exponential distribution with mean one. By the averaging argument,

*u*reduces as follows: Finally, since

*u*is distributed exponentially with mean one, then

*t*

_{1N}is distributed exponentially with mean , and thus must have a CV

^{2}of 1 (confirming the numerical results).

These results make clear the advantage of specific network structure over
arbitrary connectivity. A CV^{2} of 1 is the same as the reliability of
a two-state, one-step process. That is, a random network structure, regardless
of the size of *N*, is minimally reliable.

## 5. Summary

Many physical systems require reliability in signal generation or temporal
processing. We have shown that for systems that may be modeled reasonably well as
Markov chains, an irreversible linear chain architecture with the same transition
rate between all pairs of adjacent states (see Figure 1b) is uniquely optimal over the entire set of possible network
structures in terms of minimizing the variability of the hitting time *t*_{1N} (equivalently, the architecture that optimally minimizes the variability of
the total generated signal *F*_{1N} is a linear chain with transition rates between pairs of adjacent states that
are proportional to the state-specific signal accumulation rates). This result
suggests that a physical system could become perfectly reliable by increasing the
length of the chain, and so we have attempted to understand why perfect reliability
is not observed in natural systems by employing energy cost functions that,
depending on the amount of available energy, reduce the possible set of network
structures by some degree. Although the two functions are quite different, the
optimal network structures resulting from maximizing the system reliability under
the constraints of either function are in fact quite similar. In short, they are
irreversible linear chains with a fixed maximum length.

We would predict that natural systems for which temporal or total signal reliabilities are necessary features would be composed of linear chains of finite length, with the length determined by the specific constraints encountered by the system. This prediction has applications across many disciplines of biology. For example, it suggests both that the sequence of openings and closings in the assemblage of ion channels responsible for active membranes processes (i.e., action potentials) and the progression of a dynamical neural network through a set of intermediate attractor states during the cognitive task of estimating an interval of time, should be irreversible linear processes. Our analysis is also useful in the event that some system for which signal reliability is important is found to have a branching or loopy structure. By setting the linear structure as the theoretical limit, deviations from this limit may offer insight into what other counteracting goals physical systems are attempting to meet.

## Appendix A: Proof of the Optimality of the Linear, Constant-Rate Architecture

The primary theoretical result of this article is that a linear Markov chain with the
same transition rate between all pairs of adjacent states is optimally reliable in
terms of having the lowest CV^{2} of the hitting time from state 1 to state *N* of any *N*-state Markov chain (see Figure 1b). To establish this result, we prove the
following two theorems.

*The equality holds if and only if states i and j are the first and
last states of an irreversible, N-state linear chain with the same
forward transition rate between all pairs of adjacent states and with
state j as a collecting state*.

We employ an inductive argument to prove these theorems. It is trivial to establish
the base case of *N* = 2. Since the random variables *t*_{12} and *t*_{21} are both exponentially distributed (i.e., with means
1/λ_{21} and 1/λ_{12}, respectively), and since
the CV^{2} for exponential distributions is known to be unity, theorem 1
holds. Furthermore, both *t*_{12} and *t*_{21} satisfy the conditions for theorem 2 (i.e., they are hitting times
between the first and last states of linear chains with the same transition rates
between all pairs of adjacent states), and both saturate the bound. As they are the
only hitting times in a two-state network, then theorem 2 also holds for *N* = 2. This establishes a base case and allows us to
employ an inductive argument to prove the general result (i.e., by assuming that
theorems 1 and 2 hold for networks of size *N* − 1 and proving
that they hold for networks of size *N*).

*t*into a sum of simpler, independent random variables whose means and variances can be easily determined and to which some variance inequalities and the induction principle can be applied. Specifically,

_{ij}*t*can be decomposed into the following sum: where

_{ij}*t*

_{i+l}is the total time the system spends in the start state

*i*and during any loops back to state

*i*, while

*t*

_{path}is the time required for the transit along the path through the network to state

*j*after leaving state

*i*for the final time. The important part of this decomposition is that

*t*

_{path}is a random variable over a reduced network of size

*N*− 1 (excluding state

*i*), which will allow us to apply the inductive principle. Since

*t*

_{i+l}and

*t*

_{path}are independent variables due to the Markov property, their means and variances simply add, and so we can write the following expression for the CV

^{2}

_{ij}of the hitting time

*t*: To establish theorem 1, this quantity must not be less than for any topology, and to establish theorem 2, it must equal only for a hitting time

_{ij}*t*where

_{ij}*i*is the first state and

*j*the final state of a constant-rate, irreversible linear chain of length

*N*. To analyze equation A.3 and thus establish these theorems, we need expressions for the means and variances of the total pre–final transit time

*t*

_{i+l}and of the final transit time

*t*

_{path}.

### A.1. Mean and Variance of the Pre–Final Transit Time *t*_{i+l}.

*t*

_{i+l}*t*

_{i+l}can be determined by first considering the conditional case where the number of return loops back to state

*i*prior to hitting state

*j*is assumed to be

*R*. Then

*t*

_{i+l}(the sum of the total dwell time in start state

*i*plus the total loop time) conditioned on

*R*can be further decomposed as follows: where

*w*

_{i,r}is the dwell time in state

*i*at the beginning of the

*r*th loop,

*t*

_{loop,r}is the time required to return to state

*i*for the

*r*th loop, and

*w*is the dwell time in state

_{i}*i*prior to the final transit to state

*j*.

^{8}The total hitting time

*t*conditioned on

_{ij}*R*loops is given as the sum of the right-hand side of equation A.4 and

*t*

_{path}(see Figure 8 for a schematic when

*R*= 2).

*t*

_{i+l}|

*R*is simple to analyze since, due to the Markov principle, the random variables on the right-hand side of equation A.4 are all conditionally independent given

*R*. Thus, we can calculate the conditional mean and variance as and where we have used the notation that 〈

*f*(

*x*)〉

_{p(x)}and var(

*f*(

*x*))

_{p(x)}are defined, respectively, as the mean and variance of

*f*(

*x*) over the distribution

*p*(

*x*). In equations A.5 and A.6, we are able to drop the

*r*indices since we assume time homogeneity and thus that (1) the distribution over the dwell time in state

*i*prior to loop

*r*(

*w*

_{i,r}) is the same for every loop

*r*and the same as the distribution over the dwell time in

*i*prior to the final transit to

*j*(

*w*), and that (2) the distribution over the time required to loop back to

_{i}*i*for the

*r*th loop (

*t*

_{loop,r}) has the same distribution for each loop. Furthermore, in equation A.6, since the dwell time

*w*is an exponentially distributed random variable and thus has a variance equal to the square of its mean, we have substituted var(

_{i}*w*) with 〈

_{i}*w*〉

_{i}^{2}.

*t*

_{i + l}〉

_{p(ti + l)}and ) from the conditional mean and variance (see equations A.5 and A.6), the following identities are useful: and Thus, for the marginal mean, we have Similarly, for the marginal variance, we have where we have used the fact that the mean dwell time 〈

*w*〉 and the mean loop time 〈

_{i}*t*

_{loop}〉 are both independent of

*R*, and, in the final step, the fact that the number of loops

*R*is given by a shifted geometric distribution, and thus has a variance equal to 〈

*R*〉(〈

*R*〉 + 1).

### A.2. Variance of the Final Transit Time *t*_{path} in Terms of Hitting Times.

_{path}

*t*

_{path}in terms of hitting times over reduced networks of size

*N*− 1 (so that we can use induction), we apply the identity given in equation A.8 to decompose the variance of

*t*

_{path}as where state

*k*is the first state visited by the system after state

*i*at the beginning of the final transit from

*i*to

*j*. The random variable

*t*

_{path}was originally defined as the time required for the final transit to state

*j*, and so, given a specific start state,

*t*

_{path}|

*k*is thus the time required for the path from state

*k*to state

*j*, which is exactly the definition of the hitting time

*t*. Substituting this equivalence into equation A.12, we get the following: where we have replaced the variance of the hitting time

_{kj}*t*with the product of the squares of the CV and the mean, an equivalent formulation.

_{kj}### A.3. Establishing Theorem 1.

^{2}

_{ij}of the hitting time

*t*(see equation A.3), we can replace var(

_{ij}*t*

_{i+l}) with the second term of equation A.11 and var(

*t*

_{path}) with the second term of equation A.13 (the first terms of equations A.11 and A.13 are nonnegative) to state the following bound: Next, we can employ the inductive step of the proof and assume that theorem 1 is true for the reduced networks represented by

*t*(recall that these subnetworks are of size

_{kj}*N*− 1 since state

*i*is withheld by definition from

*t*

_{path}and thus from

*t*). This substitution yields the expression We can also use the fact that the second moment of a random variable is not less than the square of its mean (i.e., 〈

_{kj}*x*

^{2}〉 ⩾ 〈

*x*〉

^{2}) to perform an additional inequality step: Finally, recalling that the hitting time mean 〈

*t*〉 is equivalent to , we can use the identity given in equation A.7 to replace and state the following final expression for the CV

_{kj}^{2}

_{ij}: or where, for notational simplicity, we have replaced 〈

*t*

_{i+l}〉 and 〈

*t*

_{path}〉 with

*L*and

*P*, respectively.

^{2}

_{ij}is not less than for all networks of size

*N*. Since equation A.18 is true for all networks, if the minimum value of the ratio on the right-hand side of the inequality is greater than or equal to , the theorem is proved. The network topology affects this ratio through the values of

*L*and

*P*, and so we minimize with respect to these variables, ignoring whether the joint minimum of

*L*and

*P*corresponds to a realizable Markov chain (i.e., since the unconstrained minimum cannot be greater than any constrained minimum, if the inequality holds for the unconstrained minimum, it must hold for all network structures): Note that the ratio in equation A.19 is a Rayleigh quotient as a function of the vector (

*L*,

*P*)

^{T}and thus has a known minimum solution (which we derive here for clarity) (Strang, 2003). Equation A.19 gives rise to a Lagrangian minimization as with Lagrange multiplier ϕ. Differentiating with respect to

*L*and

*P*gives expressions for these variables in terms of ϕ as and Substituting these expressions back into equation A.19 establishes the proof of the theorem:

### A.4. Establishing Theorem 2.

In order to prove that an irreversible linear chain with the same forward
transition rate between all adjacent pairs of states is the unique topology that
saturates the bound on the CV^{2}_{ij} of the hitting time *t _{ij}* given by theorem 1, we follow a similar inductive approach as in
section A.3. To derive the inequality
expression for the CV

^{2}

_{ij}given by equation A.18, three successive inequality steps were employed. For equation A.18 to be an equality—a necessary condition for the bound in theorem 1 to also be an equality—each of those steps must be lossless. If the steps are lossless only for the linear chain architecture, then the theorem is proved.

Consider the second inequality step (the inductive step) in section A.3, which results in equation A.15. Recall that the hitting
times *t _{kj}* represent subnetworks of size

*N*− 1 starting in some set of states where every is reachable by a single transition from state

*i*. For equation A.15 to be an equality, the CV

^{2}

_{kj}must be equal to for all . By assuming the inductive hypothesis that, for networks of size

*N*− 1, constant-rate linear chains are the only topologies that saturate the bound, then, for equation A.15 to be an equality, all the states in set must be start states of linear chains of length

*N*− 1. This is clearly possible only if the set consists of a single state

*k*.

This constraint, that there is a single state *k* reachable by
direct transition from state *i* and that this state is the start
state of a constant-rate linear chain of length *N* − 1,
forces the other two inequality steps in section A.3 (see equations A.14 and A.16) to
also be equalities. The mean number of loops 〈*R*〉
is zero since no loops are possible (i.e., after transitioning from state *i* to *k*, the system follows an irreversible
linear path to *j*, which never returns to *i*),
and so the first term of equation A.11 is zero. Furthermore, the variance of
〈*t _{kj}*〉 is zero since there is only one , and so the first term of equation A.13 is also zero. Thus, the substitutions
comprising the first inequality step (see equation A.14) are lossless. Similarly, since there is
only one , the second moment of 〈

*t*〉 equals the square of its mean, which makes the substitution resulting in the final inequality (see equation A.16) also lossless.

_{kj}*k*is the only state reachable from state

*i*and state

*k*is the start state of a constant-rate linear chain of length

*N*− 1, then the following holds: We can simplify this expression by noting (1) that

*R*= 0 and so 〈

*t*

_{i+l}〉 = 〈

*w*〉 = 1/λ

_{i}_{ki}where λ

_{ki}is the transition rate from state

*i*to state

*k*, (2) that there is only one , and so 〈

*t*

_{path}〉 = 〈

*t*〉, and (3) that

_{kj}*t*represents a constant-rate linear chain of length

_{kj}*N*− 1, and so its mean hitting time will be the number of transitions divided by the constant transition rate (i.e., for constant rate λ):

_{ki}and λ that minimize the CV

^{2}

_{ij}, we can define the following Lagrangian: Differentiating by each variable and substituting out the Lagrange multiplier ϕ establishes the theorem: All transition rates are equal, and so the uniqueness proof is complete. An

*N*-state Markov chain saturates the bound given in theorem 1 if and only if it is an irreversible linear chain with the same forward transition rate between all pairs of adjacent states. Furthermore, the bound is saturated only for the hitting time from the first to the last state in the chain (see Figure 1b).

## Appendix B: The Moments of *t*_{ij}

*t*

_{ij}*N*th moment of the hitting time

*t*from state

_{ij}*i*to

*j*for the Markov chain given by the transition rate matrix . The subscript

*j*in is used to denote the fact that the

*j*th column of the matrix is a vector of all zeros (i.e.,

*j*is a collecting state). For the purposes of this derivation, we assume that the underlying transition rate matrix , without the connections away from

*j*removed, represents a Markov chain for which all states are reachable from all other states in a finite amount of time. In other words, is assumed to be ergodic (although the resulting formulas still hold if this assumption is relaxed). Substituting

*t*for

*t*to simplify notation and using the expression for the probability distribution of

_{ij}*t*given in equation 2.6, we have where we have used the fact that a matrix commutes with the exponentiation of itself.

_{ij}**P**. If the eigenvalue decomposition of is

_{A}**RDL**(with ), then where

**P**is a diagonal matrix composed of the inverse eigenvalues of except for the

_{D}*j*th entry which is left at zero (the

*j*th eigenvalue of is zero). In matrix notation,

**P**is given as which gives

_{D}**P**as where, in the final step, we have used the fact that the

_{A}*j*th column of

**R**(the

*j*th right eigenvector of

**A**

_{j}) is (since the

*j*th column of

**A**

_{j}is ) and the fact that the

*j*th row of

**L**(the

*j*th left eigenvector) is (since the columns of

**A**

_{j}all sum to 0). Thus, and .

^{9}

**DP**is an identity matrix except for a zero in the

_{D}*j*th diagonal entry (due to the noninverted zero eigenvalue). Furthermore, it is trivial to show that for any positive integer

*n*(since ), and so we have derived the following expression for the identity matrix: Finally, this allows us to restate the transition rate matrix as where again we have used the fact that is the eigenvector of associated with the zero eigenvalue.

*j*th diagonal element (i.e., the zero eigenvalue of

**A**

_{j}associated with eigenvector ). The

*k*th diagonal element of the integral for

*k*≠

*j*is given by where η

_{k}is the

*k*th eigenvalue. These integrals are analytically tractable: where we have used the fact that all of the eigenvalues of except the

*j*th are strictly negative, which is a result of the following argument. Since is a properly structured transition rate matrix for a continuous-time Markov chain, there exists a finite, positive

*dt*such that is a properly structured transition probability matrix for a discrete time Markov chain. We can rewrite this probability matrix as and use the Perron-Frobenius theorem, which states that the eigenvalues of transition probability matrices (i.e., the entries of ) are all less than or equal to one (Poole, 2006). Furthermore, since we assumed that the underlying Markov chain is ergodic, the Perron-Frobenius theorem asserts that exactly one of the eigenvalues is equal to one. Thus, it is clear that one of the entries of is equal to zero and the rest are negative.

As an alternative to the preceding somewhat cumbersome algebra, it is also possible to use an intuitive argument to find the analytic expression for the first moment (mean) of the hitting time (Norris, 2004). With the expression for the first moment known, the higher-order moments can then be derived using Siegert's recursion (Siegert, 1951; Karlin & Taylor, 1981).

## Appendix C: The Gradients of the Mean, Variance, and Energy Cost Functions

*t*

_{1N}are given as and The definition of

**P**(see equation B.3) and the expression for the derivative of an inverse matrix () give the following: where we have defined as the derivative of with respect to the variable of interest, and have used the fact that and are the right and left eigenvectors of associated with the zero eigenvalue and thus that both and are zero. Therefore, the gradients of the mean and variance with respect to the optimization parameters θ

_{A}_{ij}(where θ

_{ij}≡ −ln λ

_{ij}for transition rate λ

_{ij}) can be shown to be and where the element in the

*k*th row and

*l*th column of the differential matrix is given by

## Appendix D: Derivation of Pure Diffusion Solution

*E*

_{tot}= 0, under the energy function given in equation 4.3, it is possible to analytically solve for the transition rates that minimize the CV

^{2}of the hitting time. First, note that all pairs of reciprocal rates must be equal (since the energy is zero), and furthermore, that all pairs of rates between nonadjacent states are equal to zero. Thus, to simplify notation, we shall consider only the rates λ

_{i}for

*i*∈ {1, …,

*M*} where λ

_{i}≡ λ

_{i,i+1}= λ

_{i+1,i}and

*M*≡

*N*− 1. This yields the following transition rate matrix:

*t*

_{1N}for abitrary Markov chains (see equation B.10), we have, respectively, and where (see equation B.3) as before. For the specific tridiagonal matrix given in equation D.1,

**P**can be shown to be from which, with a bit more manipulation, we can give expressions for the mean and variance as follows: and where we have defined the vectors and as and

_{A}*z*≡

_{i}*i*, respectively, and the matrix as

*Z*≡ min(

_{ij}*i*,

*j*)

^{2}.

_{i}, that minimizes the CV

^{2}of

*t*

_{1N}is equivalent to employing Lagrangian optimization to minimize the variance while holding the mean constant at 〈

*t*

_{1N}〉. This gives the following simple linear algebra problem: where is guaranteed to be the unique optimum since is positive definite (see section D.1) and the constraint is linear. Thus, the solution can be found by setting the gradient to zero: Note that we did not enforce that the elements of (and thus the rates λ

_{i}) be positive under this optimization. However, if the solution has all positive entries, as will be shown, then this additional constraint can be ignored.

*i*<

*M*, and, for

*i*=

*M*, For positive values of α—corresponding to positive values of 〈

*t*

_{1N}〉—all of the elements of are positive, and thus this solution is reasonable. Therefore, the following rates minimize the processing time variability for a zero-energy, purely diffusive system:

*M*), can be shown to have the following closed-form solution (Abramowitz & Stegun 1964): where Ψ(

*x*) is the digamma function defined as the derivative of the logarithm of the gamma function (i.e., ) and γ is the Euler–Mascheroni constant.

^{2}in terms of

*M*and 〈

*t*

_{1N}〉. From the derivation of equation D.8, we have which can be substituted into equation D.6 to get Finally, using the expressions for the mean and for α (see equations D.5 and D.17), we obtain the following result: where we revert to a notation using the number of states

*N*.

### D.1. The Matrix **Z**_{ij}**≡ min(***i*, *j*)^{2} Is Positive Definite.

*i*,

*j*)

^{2}

*M*×

*M*matrix , where , is positive definite. First, note the following identity: which can be easily proven inductively. Now let us define a set of vectors for

*i*= 1, …,

*M*where vector consists of

*i*− 1 zeros followed by

*M*−

*i*+ 1 elements all having the value —for example, and From the identity given in equation D.22, can be rewritten as the following sum of outer products:

*M*of them, they form a basis. Since the projection of an arbitrary nonzero vector on at least one basis vector must be nonzero, one of the terms in the sum in equation D.27 must be positive. Thus, we have and so is positive definite.

## Acknowledgments

S.E. acknowledges the N.I.H. Medical Scientist Training Program and the Columbia University M.D.-Ph.D. Program for supporting this research. L.P. is supported by an NSF CAREER award, an Alfred P. Sloan Research Fellowship, and a McKnight Scholar award. We thank B. Bialek, S. Ganguli, L. Abbott, T. Toyoizumi, X. Pitkow, and other members of the Center for Theoretical Neuroscience at Columbia University for many helpful discussions.

## Notes

^{1}

As we show, in the case that the state-specific signal accumulation rates are all unity, then the generated signal is the system processing time itself.

^{2}

States 1 and *N* are arbitrary, albeit convenient, choices. The
labels of the states can always be permuted without changing the underlying
network topology.

^{3}

Whether *N* truly is collecting or not does not affect the hitting
time *t*_{1N} since this random variable is independent of the behavior of the system
after arrival at state *N*. Thus setting the *N*^{th} column of to zero does not result in a loss of generality.

^{4}

In animal and human behavioral data, the variance of a measured interval of time
is proportional to the square of its mean (Weber's law). As discussed in section 2.2, all timing mechanisms that can
be modeled as Markov chains will have a constant CV^{2} (and will thus
exhibit Weber's law), but our proof shows that a constant-rate linear mechanism
is optimally reliable.

^{5}

Note that an arbitrary Markov chain is not conservative (i.e., the path integral
of the energy depends on the path), so, although for a pair of states *i* and *j* each state can be thought of as
having an associated energy, these associated energies may change when *i* and *j* are paired with other states in
the chain.

^{6}

It is not clear why the first rate pair has its own unique behavior. Attempts to analytically solve for the rate values at finite, nonzero energies were unsuccessful, but these numerical results were robust. Similarly, we were unable to determine expressions for the merge-point energies.

^{7}

Note that many more states may be in the diffusive tail than in the irreversible
linear chain portion of the solution, but as long as
ξ(*M _{r}*) ≪

*M*, these states fail to remarkably change the reliability of the system.

_{i}^{8}

Note that hitting time variables (e.g., *t _{ij}*) refer to specific start and end states (

*i*and

*j*, in this case), while

*t*

_{loop}and

*t*

_{path}refer to specific end states (

*i*and

*j*, respectively), but not specific start states.

^{9}

Note that **P _{A}** defined in this manner does not meet all of the conditions of the unique
Moore-Penrose pseudoinverse (Penrose & Todd, 1955). Though the equalities and hold (as long as is an appropriately structured transition rate matrix with

*j*as the unique collecting state), the products and are not symmetric matrices (as they are for the Moore-Penrose pseudoinverse).