Markov chains are a class of probabilistic models that have achieved widespread application in the quantitative sciences. This is in part due to their versatility, but is compounded by the ease with which they can be probed analytically. This tutorial provides an in-depth introduction to Markov chains and explores their connection to graphs and random walks. We use tools from linear algebra and graph theory to describe the transition matrices of different types of Markov chains, with a particular focus on exploring properties of the eigenvalues and eigenvectors corresponding to these matrices. The results presented are relevant to a number of methods in machine learning and data mining, which we describe at various stages. Rather than being a novel academic study in its own right, this text presents a collection of known results, together with some new concepts. Moreover, the tutorial focuses on offering intuition to readers rather than formal understanding and only assumes basic exposure to concepts from linear algebra and probability theory. It is therefore accessible to students and researchers from a wide variety of disciplines.

Markov chains are a versatile tool for modeling stochastic processes and have been applied in a wide variety of scientific disciplines, such as biology, computer science, and finance (Pardoux, 2010). This is unsurprising considering the number of practical advantages they offer: (1) they are easy to describe analytically, (2) in many domains they make complex computations tractable, and (3) they are a well-understood model type, meaning that they offer some level of interpretability when used as a component of an algorithm. Furthermore, as we show in this tutorial, Markov chains are temporal processes that take place on graphs. This makes them particularly suitable for modeling data-generating processes that underlie time series and graph data sets, both of which have received much attention in the fields of machine learning and data mining (Aggarwal, 2015).

The application of Markov chains requires the assumption that at least some aspect of the process being modeled has no memory. An important consequence of this assumption is that the process can be described in detail using a transition matrix. Furthermore, there exists a rich framework for describing distinct features of such processes based on the eigenvalues and eigenvectors of this matrix. This tutorial provides an in-depth exploration of this framework, making use of tools from probability theory, linear algebra, and graph theory. Since the work is intended for readers from diverse academic backgrounds, we concentrate on providing intuition for the tools used rather than strict mathematical formalism.

The material presented underlies multiple methods from different areas of machine learning, and instead of exploring these methods individually, we focus on the general properties that make Markov chains useful across these domains. Nonetheless, so that readers can appreciate the scope of the tutorial, we briefly summarize the methods that it is relevant to. In graph-based unsupervised learning, it is related to nonlinear dimensionality reduction techniques such as Laplacian eigenmaps (Belkin & Niyogi, 2001, 2003) and spectral clustering (Weiss, 1999; Ng et al., 2001; von Luxburg, 2007). These two closely related methods aim to represent data sets in a way that preserves local geometry and are traditionally formulated using graph Laplacians. However, one line of work on spectral clustering instead uses Markov chains (Meilă & Shi, 2000, 2001; Tishby & Slonim, 2001; Saerens et al., 2004; Liu, 2011; Meilă & Pentney, 2007; Huang et al., 2006; Weinan et al., 2008). Furthermore, the method of diffusion maps (Coifman et al., 2005a, 2005b; Coifman & Lafon, 2006) is a generalization of Laplacian eigenmaps that is based on Markov chains and can be tuned to different length scales in a graph, thereby allowing a multiscale geometric analysis of data sets. An in-depth survey of Laplacian eigenmaps, spectral clustering, diffusion maps, and other related methods can be found in Ghojogh et al. (2021). In the domain of time series analysis, the tutorial is relevant to slow feature analysis (SFA) (Wiskott & Sejnowski, 2002), a dimensionality-reduction technique that is based on the notion of temporal coherence and is conceptually related to Laplacian eigenmaps (Sprekeler, 2011). The ideas underlying Laplacian eigenmaps and spectral clustering have also been extended to classification problems, both for labeled (Kamvar et al., 2003) and partially labeled data sets (Szummer & Jaakkola, 2001; Joachims, 2003; Zhou et al., 2005). Finally, the material presented in this tutorial also forms the basis of various approaches to value function approximation in reinforcement learning, such as Mahadevan’s proto-value functions (Mahadevan, 2005; Mahadevan & Maggioni, 2007; Johns & Mahadevan, 2007), Stachenfeld’s work on successor representation (Stachenfeld et al., 2014, 2017), and other closely related methods (Petrik, 2007; Wu et al., 2019). Something common to many of the applications mentioned thus far is that they assume all underlying graphs to be undirected or, equivalently, that the corresponding Markov chain is reversible. This provides a number of guarantees that are crucial for these methods to work, and we explore these guarantees in depth in this tutorial. In most cases, the extension to the directed/nonreversible setting faces a number of challenges and is still actively researched. We discuss these challenges and present various solutions that have been suggested in the literature.

The rest of the text is organized as follows. In section 2, we give a general introduction to discrete-time, stationary Markov chains on finite state spaces and explore some specific types of chains in detail. Section 3 then gives a formal introduction to graphs in order to provide a more detailed description of Markov chains. In section 4, random walks are presented as a canonical transformation that turns any graph into a Markov chain, and the undirected and directed cases are considered separately to better understand the types of Markov chains that they typically give rise to. Finally, in section 5, we explore applications of the material in earlier sections in two areas of computer science literature.

2.1  Definition

Markov processes are an elementary family of stochastic models describing the temporal evolution of an infinite sequence of random variables X={Xt:tT}, defined on a state space S and indexed by a time set T. Such processes respect the Markov property, in which the future evolution is conditionally independent of the past, given the present state of the chain. In this tutorial, we focus on models for which time is discretized (i.e., T=N0), known as Markov chains. Furthermore, we restrict our consideration to Markov chains defined on finite state spaces with |S|=N states. In such settings, the Markov property can be formalized in terms of transition probabilities:
Pr(Xt+1=x|X1=x1,X2=x2,X3=x3,...,Xt=xt)=Pr(Xt+1=x|Xt=xt).
(2.1)
If these probabilities are themselves independent of time, the Markov chain is said to be homogeneous, and its evolution can be fully described by one-step transition probabilities between pairs of states in S: Pr(Xt + 1 = sj|Xt = si) = Pij. Collectively, these probabilities can be represented as an N × N matrix:
P=P11P12P1NP21P22P2NPN1PN2PNN,
(2.2)
which is called the transition matrix of the Markov chain. Note that the transition probabilities from each state si are organized along the rows of the matrix in equation 2.2 meaning that the rows sum to one: j=1NPij=1i. Another name for this type of matrix is a right stochastic matrix, and when the transition probabilities are instead organized along the columns, P is instead known as a left stochastic matrix and it has columns that sum to 1.

Markov chains can also be depicted visually in the form of a graph, with the state space S drawn as a collection of circles and labeled arrows between these circles representing the nonzero transition probabilities Pij. We call this diagram the transition graph of a Markov chain. A formal introduction to the mathematics of graphs is given in section 3, but until then, transition graphs are simply used as an illustrative tool.

As an example, imagine you are a PhD student who wants to evaluate how efficiently you work. In order to simplify your analysis, you posit that at any time of a working day, you are doing one of four activities: (1) studying, (2) speaking to your professor, (3) eating food, and (4) drinking coffee, which you denote as a set of states s1 = S, s2 = P, s3 = F, and s4 = C, respectively. As a further simplification, you assume that transitions between these activities are Markovian. After monitoring your activities for a few days, you come up with a set of empirical transition probabilities that you use to construct a transition graph, shown in Figure 1, and the following transition matrix:
SPFCP=SPFC0.50.10.20.21000000.50.51000.
(2.3)
With either of these two representations, it is straightforward to generate a realization of this Markov chain. To do this, we first need to pick a starting activity. Suppose that you are studying at time t = 0; then all activities are possible at t = 1. In order to choose from these possibilities, one must sample from a probability vector equal to the first row of P, that is, Pr(X1|X0 = S) = (0.5, 0.1, 0.2, 0.2). If our sample yields X1 = C, then this becomes the current state and we repeat the process. Doing this iteratively can generate sequences of arbitrary length, for example,
S-C-S-F-F-C-S-P-S-F-...,
which we refer to as a trajectory in the state space S. As is often the case when studying a stochastic process, generating single trajectories is rather uninformative since it provides no collective description of how the process tends to evolve. Naively, one way we could try to achieve such a description would be to perform a type of Monte Carlo sampling (see section 5.1) by generating several trajectories from the same starting state and summarizing the frequency with which future activities occur.
Figure 1:

The transition graph of a Markov chain describing the activities of a PhD student.

Figure 1:

The transition graph of a Markov chain describing the activities of a PhD student.

Close modal

In Figures 2a to 2d we do this for n = 10 trajectories of length 4, each starting with X0 = S. Each trajectory is depicted in a specific color, and consists of points in a transition graph plotted across four time points, so that the position of each point indicates a state that one of the trajectories is in at time t. Other than a slight bias toward studying (S) at each time point, it is hard to pick out any clear patterns using only these 10 trajectories. Figures 2e to 2h show similar plots for n = 100, but with all points colored black and the relative occupation of each state for t > 0 indicated by a percentage. Finally, we increase the number of trajectories to n = 1000 in Figures 2i to 2l. Percentages are again used to indicate the relative state occupations, but instead of representing the trajectories with dots, we color each state in gray-scale based on the percentage values. Comparing all the plots in this figure, one can note that in the first row, it is possible to track each of the individual trajectories, whereas in the second and third rows, the focus is instead on approximating the relative probability of doing each activity at each time.

Figure 2:

Simulations of the chain in Figure 1, each of length 4 and with starting state s1 = S. (a–d) n = 10 simulations of the chain, with colors indicating each trajectory. (e–h) n = 100 simulations, this time all depicted in black. (i–l) The outcome of n = 1000 simulations via gray-scale shadings. In panels e–h and i–l, percentage values indicate the relative population of each state at each time point.

Figure 2:

Simulations of the chain in Figure 1, each of length 4 and with starting state s1 = S. (a–d) n = 10 simulations of the chain, with colors indicating each trajectory. (e–h) n = 100 simulations, this time all depicted in black. (i–l) The outcome of n = 1000 simulations via gray-scale shadings. In panels e–h and i–l, percentage values indicate the relative population of each state at each time point.

Close modal
The plots of Figure 2 represent collections of random experiments, and therefore running them again would not yield exactly the same outcomes. However, taking a frequentist interpretation of probability, we can ask: In the limit n → ∞, how often is each activity done at time t? The answer to this question for t = 1 is given by the vector Pr(X1|X0 = S) = (0.5, 0.1, 0.2, 0.2), from which we sample at each time point in a trajectory. For t = 2, a distribution vector can be calculated by evaluating all the possible trajectories that lead up to each state after two steps. For example, first consider the probability with which we are drinking coffee at t = 2. Clearly, there are two ways this can happen: (1) SSC (i.e., s1s1s4) and (2) SFC (i.e., s1s3s4). By combining the corresponding probabilities, we get
Pr(X2=s4|X0=s1)=j=14Pr(X1=sj|X0=s1)Pr(X2=s4|X1=sj)
(2.4)
=j=14P1jPj4
(2.5)
=P11P14+P12P24+P13P34+P14P44
(2.6)
=0.1+0+0.1+0
(2.7)
=0.2.
(2.8)
By performing a similar calculation for the other activities at t = 2, we get the following probability vector: Pr(X2|X0 = S) = (0.55, 0.05, 0.2, 0.2). A number of comments can be made at this stage. First, while we can extend this type of calculation to t > 2, this quickly becomes infeasible to do by hand as the number of steps increases. It turns out that there is a simple mathematical formalism that makes these computations both more efficient and more interpretable. We introduce this formalism in the following section. Second, we can also apply the distribution picture at t = 0. In the example we gave, we always started in the same state, but we can easily generalize this to the case where X0 is not fully determined. For example, Pr(X0) = (0.5, 0, 0, 0.5) indicates that studying and drinking coffee both occur at t = 0 with probability 0.5. Finally, while the trajectories of a Markov chain were random, the two distributions we obtained for t = 1 and t = 2 were fully determined by our initial condition of X0 = s1 = S. Thus, moving from the trajectory perspective to the distribution perspective rather interestingly makes the evolution of our Markov chain look deterministic.

2.2  Evolution via Matrix-Vector Multiplication

The structure and action of P, just like any other matrix, can be evaluated using tools from linear algebra. In particular, P can either multiply column vectors from the left or row vectors from the right. While typical conventions formulate matrix multiplication in the former way, the latter is more common for right stochastic matrices due to its semantic interpretation. Nonetheless, as both operations offer their own insight into the descriptive capacity of Markov chains, we outline both in this section.

For any vector xRN, we see that by multiplying it in row form with P from the right gives a new vector zRN:
zT=xTP,
(2.9)
where zj=i=1NxiPij. By summing over the elements of z, we see that
j=1Nzj=j=1Ni=1NxiPij=i=1Nxij=1NPij=1=i=1Nxi,
(2.10)
meaning that multiplying with P from the right preserves the sum over vector entries. Furthermore, note that since all entries of P are nonnegative, if x is nonnegative, then so too is z. These two properties are particularly useful when dealing with probability vectors, which by definition both sum to 1 and have nonnegative entries, since it means that a probability vector x is mapped into another probability vector when multiplied from the right by P.
As we have seen, probability vectors are a suitable representation for tracking the evolution of a Markov chain, and as a shorthand, we describe the distribution of a chain at time t by μ(t)=(μ1(t),μ2(t),...,μN(t))T. Such a distribution can easily be evolved into another distribution describing the chain at time t + 1. To see this, consider the probability of being in some state sj at time t + 1, that is, μj(t + 1). This depends both on the probability of being in each state si at time t (i.e., μi(t)), as well the probability of making a transition from each state si to sj (i.e., Pij). By summing over all possible states si, we therefore see that
μj(t+1)=i=1Nμi(t)Pij.
(2.11)
Using these probabilities, we can now form the probability vector μ(t+1)=(μ1(t+1),μ2(t+1),...,μN(t+1))T, which is described by the vector-matrix analogue of equation 2.11:
μ(t+1)T=μ(t)TP.
(2.12)
Thus, multiplying probability vectors with P from the right represents the one-step evolution of a Markov chain. Furthermore, this operation can be extended to multiple steps of evolution by raising the power of P:
μ(t+k)T=μ(t+k-1)TP
(2.13)
=μ(t+k-2)TPP
(2.14)
(2.15)
=μ(t)TPPktimes
(2.16)
=μ(t)TPk,
(2.17)
where it is straightforward to establish that Pk is itself stochastic. The fact that the k-step evolution of a chain is represented simply by Pk is a result known as the Chapman-Kolmogorov equation (Stewart, 1994), and it tells us that the transition matrix is the only thing needed to evolve a starting distribution of a Markov chain arbitrarily far into the future.

A particularly special type of distribution for a Markov chain is one that is invariant under its evolution:

Definition 1
(Stationary Distribution). A distribution π that is invariant when multiplied by P from the right,
πT=πTP,
(2.18)
is said to be a stationary distribution of the Markov chain, and for finite state spaces, it is guaranteed that at least one such distribution exists.
The set of N equations implied by equation 2.18 are often referred to as the equations of global balance. To understand why, consider the jth equation in this set:
πj=i=1NπiPij.
(2.19)
The right-hand term in equation 2.19 represents the total flow of probability mass into state sj from all other states (including sj itself when Pjj ≠ 0). Since πj remains invariant under this flow, then an equal amount of probability mass must be flowing from sj to all other states in S, hence the name global balance. An important implication of equations 2.18 and 2.19 is that when a chain is in a stationary distribution at time t, it remains there for all future time steps. We can therefore interpret such distributions as a type of steady state of the underlying process.

We can also ask how much probability mass moves from one state to another in a given stationary distribution π. This is described by the following matrix:

Definition 2
(Flow Matrix). Given that a Markov chain is in one of its stationary distributions π, the corresponding flow matrix is defined as
Fπ:=ΠP,
(2.20)
where Π:=diag(π) is a diagonal matrix consisting of the entries of π, and (Fπ)ij=πiPij is the stationary flow of probability mass from one state si to another state sj.
So far we have considered only row vectors and their interpretation as probability vectors was fitting due to the way they transform when multiplied by P. For reasons that we outline below, this interpretation does not apply in the case of column vectors, which are multiplied by P from the left. To show this, we first assume that x=(x1,x2,...,xN)T is any vector in RN. Since this vector assigns a value to each state, we can interpret it as a function on S. If the chain is described at time t by μ(t), we can use this distribution to calculate the expected value of x:
Eμ(t)(x)=iμi(t)xi
(2.21)
=μ(t)Tx.
(2.22)
If we want to look k time steps into the future, then expected values are calculated in the same way but with the additional evolution of μ(t) in accordance with equation 2.17:
Eμ(t)k(x)=Eμ(t+k)(x)
(2.23)
=μ(t+k)Tx
(2.24)
=μ(t)TPkx
(2.25)
=Eμ(t)(Pkx).
(2.26)
Hence, given any starting distribution μ(t), the transition matrix can produce expected values of any function on the state space arbitrarily far into the future.
Often the initial condition of a Markov chain has zero uncertainty, with all probability occupying a single state si. In such settings, the starting distribution is a row vector with one component equal to one and all others zero, known as a one-hot vector and denoted by ei. If we evaluate the k-step expectation in equation 2.25 with ei as a starting distribution, we get
Eeik(x)=eiTPkx
(2.27)
=j(Pk)ijxj
(2.28)
=(Pkx)i.
(2.29)
This tells us how a vector x is transformed when multiplied by Pk from the left: it produces a new vector x'=Pkx, whose ith element is the expected value of x after k steps, conditioned on the starting state being si. Formally, this is a conditional expectation,
(Pkx)i=Ek(x|Xt=si),
(2.30)
which is the main operation underlying value functions in reinforcement learning (Sutton & Barto, 2018) (see section 5.2).
Summing over the elements of x' gives
i=1Nxi'=i=1Nj=1N(Pk)ijxj=j=1Nxji=1N(Pk)ij1j=1Nxj
(2.31)
since the columns of Pk, unlike its rows, do not sum to one. Hence, multiplying column vectors with Pk does not preserve the vector sum, which is why they are best interpreted as functions rather than distributions.

2.3  Eigenvalues and Eigenvectors

Every real matrix ARN×N can be interpreted as a linear transformation. A central task of linear algebra is to shed light on the relationship between the numerical properties of a matrix and various aspects of the transformation that it represents. Often a single matrix represents a combination of several distinct transformations; for example, an object in two or more dimensions can simultaneously be rotated and stretched. Finding the eigenvalues and eigenvectors of a matrix is one way to partition a linear transformation into its component parts and reveal their relative magnitudes. In this section, we give a brief and informal summary of how this works and apply this to transition matrices.

An eigenvector of a matrix A is a vector that is only multiplied by some number λ when multiplied by A. Like all other vectors, they can either be rows, that is, lTA=λlT, or columns, that is, Ar=λr, which are known as left eigenvectors and right eigenvectors, respectively. In both cases, the number λ is called the eigenvalue of the respective eigenvector. It is worth noting that real matrices, including transition matrices, can have complex eigenvalues, λC, in which case the corresponding eigenvector is also complex, vCN. However, such solutions can occur only in complex conjugate pairs, meaning that λ* and v* are also guaranteed to be an eigenvalue-eigenvector pair, where * denotes the complex conjugate.

A quick inspection of equation 2.18 reveals that we have already encountered an example of an eigenvector in the case of a transition matrix: a stationary distribution π is a left eigenvector of P with eigenvalue λ = 1, normalized such that it sums to 1. Similarly, it is straightforward to see that η=(1,1,...,1)T is a right eigenvector with eigenvalue λ = 1,
(Pη)i=j=1NPijηj=j=1NPij=1,
(2.32)
so that Pη=η.
We can similarly describe the action of a matrix A on a set of k eigenvectors. Let YRRN×k be a matrix whose columns are equal to k right eigenvectors A:
YR=r1r2rk.
(2.33)
Then the action of A on this matrix is simply
AYR=Ar1Ar2Ark
(2.34)
=λ1r1λ2r2λkrk
(2.35)
=r1r2rkλ1000λ200000λN
(2.36)
=YRΔ.
(2.37)
If we had alternatively used a matrix YL with rows equal to left eigenvectors of A, then an equivalent argument can show that YLA=ΔYL.
In linear algebra, one is often interested in finding linearly independent sets of eigenvectors. Without this condition, a set of eigenvectors can contain a lot of redundancy. For example, if r is an eigenvector of A with eigenvalue λ, then so too is the vector cr for any cC. Therefore, it is possible to construct an arbitrarily large set of eigenvectors using r alone, and the set of all possible vectors that can be formed in this way is known as the eigenspace of A corresponding to λ. If we instead look for a set of linearly independent eigenvectors of A, there can be at most N. When a set of N linearly independent eigenvectors exists, they form a basis for RN (the domain on which A acts). In this case, the set of eigenvectors is called an eigenbasis, and the matrix A is said to be diagonalizable. To justify the use of this latter term, imagine that the matrix YR contains N linearly independent right eigenvectors. Then, by definition, it is full rank, meaning that its inverse YR-1 exists. If we then multiply both sides of equation 2.37 by this inverse from the left, we get
YR-1AYR=Δ.
(2.38)
Before interpreting this expression, it is worth reflecting on its general form. If two matrices B and C can be related via U-1BU=C for some invertible matrix U, then they are said to be similar. Alternatively, C is said to be a similarity transformation on B, and since we could instead write B=UCU-1, the converse also holds. In words, similar matrices represent the same linear transformation but expressed in two different bases, where the matrix U is known as the change-of-basis matrix. In the case of equation 2.38, this means that A behaves like a diagonal matrix when acting on vectors that are expressed in its eigenbasis YR, hence the term diagonalizable.
Equation 2.38 can easily be rearranged to get the following expression for A
A=YRΔYR-1,
(2.39)
and since this only involves the eigenvalue and eigenvector matrices Δ and YR, it is known as the eigendecomposition of A. Furthermore, multiplying equation 2.39 from the left with YR-1 yields YR-1A=ΔYR-1, meaning that the rows of YR-1 are a set of N linearly independent left eigenvectors of A. Therefore:
A=r1r2rNλ1000λ200000λNl1Tl2TlNT
(2.40)
=λ1r1λ2r2λNrNl1Tl2TlNT
(2.41)
=ω=1NλωrωlωT,
(2.42)
where each term rωlωT in equation 2.42 is a matrix of rank 1 and is known as the outer product between rω and lω.1 Writing the matrix A in this way therefore gives good insight into how a diagonalizable matrix can be partitioned into distinct modes.
Finally, if lω and rγ are left and right eigenvectors of A with distinct eigenvalues, that is, λω ≠ λγ, then
lωTArγ-lωTArγ=0,
(2.43)
λωlωTrγ-λγlωTArγ=0,
(2.44)
(λω-λγ)lωTrγ=0.
(2.45)
Since (λω − λγ) ≠ 0 by assumption, this means that lω and rγ must be orthogonal. Thus, when all eigenvalues are distinct and the corresponding eigenvectors are normalized to have unit Euclidean norm, the following relation between the columns of YR and the rows of YR-1 holds:
lωTrγ=δωγ=1ifω=γ0otherwise.
(2.46)
Because of this, YR-1 is sometimes called the dual basis of YR, and the two sets of vectors are collectively referred to as a biorthogonal system. It is worth noting that when a diagonalizable matrix has repeated eigenvalues, there is extra freedom in the choice of left and right eigenvectors. Consequently, for such matrices, equation 2.46 is not guaranteed to hold; however, there exist certain choices of bases for which it does (Denton et al., 2021).
In the case of a Markov chain, these concepts can be applied when the transition matrix P is diagonalizable. In this case, any probability distribution μ(t) can be expressed in terms of the eigenbasis of P. Since we are treating distributions as row vectors, we do this using the left eigenvectors of P,
μ(t)T=ω=1NcωlωT,
(2.47)
where the components cω are the coordinates of μ(t) in this basis. We can then use this to reexpress the one-step evolution of the chain as
μ(t+1)T=(2.12)μ(t)TP
(2.48)
=(2.47)ω=1NcωlωTP
(2.49)
=ω=1NcωlωTP
(2.50)
=ω=1NcωλωlωT,
(2.51)
which can easily be extended to multiple steps of evolution:
μ(t+k)T=(2.17)μ(t)TPk
(2.52)
=ω=1NcωλωklωT,
(2.53)
Comparing equations 2.17 and 2.53, one can see that by working in the eigenbasis of P, we have transformed the evolution of the Markov chain from a matrix multiplication to a scalar multiplication along each basis vector lω by λωk. We have therefore improved our computational complexity from O(N2) to O(N).
This type of manipulation can be done for any diagonalizable matrix and does not depend on P being stochastic. However, one important result we present in section 3.5 is that all eigenvalues of a transition matrix have absolute value less than or equal to one. Using this fact, we can order the eigenvalues of P by their absolute value and separate the terms in equation 2.53 corresponding to eigenvalues with |λω| = 1, and those corresponding to |λω| < 1:
μ(t+k)T=ω=1mcωλωklωTpersistent+ω'=m+1Ncω'λω'klω'Ttransient,
(2.54)

where m is the number of eigenvalues of P with |λω| = 1. In the long time limit (k → ∞), the terms with |λω| = 1 survive, whereas those with |λω| < 1 die off, with |λ| measuring the rate of decay in the latter case. This allows us to interpret the first and second sums in equation 2.54 to represent the persistent and transient behavior of the Markov chain, respectively.

In the persistent case, we can partition the terms into three types based on the eigenvalues: (i) λ = 1, (ii) λ = −1, and (iii) λC, |λ| = 1. We have already seen that a stationary distribution π and the vector η are left and right eigenvectors of type i, and in both cases, these eigenvectors represent fixed structures on the state space that persist when acted on by P. To capture this property, we call such eigenvectors persistent structures. In case ii, eigenvectors flip their sign when acted on by the transition matrix, yTP=-yT, and return after two steps, yTP2=yT. Eigenvectors of this type therefore correspond to permanent oscillations of probability mass between states in S, and we therefore refer to them as persistent oscillations. Eigenvalues of type iii are explored in more depth in section 3.5, where we show that they are always complex roots of unity, λk = 1 for some k > 2, which occur at equally spaced locations on the unit circle. Therefore, when their corresponding eigenvectors are acted on repeatedly by P, they are returned to after k steps, yTPk=λkyT=yT. Thus, in analogy to case ii, they describe permanent cycles of probability mass through the state space S, and we therefore call such eigenvectors persistent cycles.

In the transient case, an analogous categorization based on the eigenvalues can be applied, consisting of the following three types: (i’) λ ∈ [0, 1), (ii’) λ ∈ (− 1, 0), and (iii’) λC, |λ| < 1. For type i’, the corresponding eigenvectors represent perturbations to the persistent behavior that decay over time, yTP=λyT where |λy|=|λ||y|<|y|. We therefore call such eigenvectors transient structures. When |λj| ≈ 1 these structures describe sets of states that, on average, a chain spends a long time in before converging; these are known as metastable sets (Conrad et al., 2016). Furthermore, λ = 0 can be thought of as a limiting case of type i’ in the sense that any corresponding eigenvector decays infinitely quickly (i.e., Py=0) and does not exhibit oscillatory or cyclic behavior. Eigenvalues of type ii’ also decay over time, but like case ii, their negative sign means that the corresponding eigenvectors exhibit oscillatory behavior when acted on by P, that is, yTP=-λyT where |λy|=|λ||y|<|y|. We refer to eigenvectors of this type as transient oscillations. Eigenvalues of type iii’ generalize those of type iii to the transient case since they are also complex and represent cycles of probability mass. In contrast to type iii, though, they do not occur on the unit circle and need not be equally spaced. Since |λ| < 1, these cycles decay over time, and we therefore call them transient cycles. When |λ| ≈ 1, these cycles can persist for a long time and have been referred to as dominant cycles (Conrad et al., 2016).

The above categorizations are summarized in Table 1, where structures, oscillations, and cycles are colored in green, blue, and red, respectively, and the persistent and transient cases are shaded bright and pale, respectively. In Figure 3, we visualize the six different types of eigenvalue using this color scheme by shading the respective regions of the unit circle in which they can occur.

Table 1:

Eigenvalues of Type (i-iii) and (i’-iii’) Organized and Colored Based on Whether They Are Persistent or Transient, and Whether They Are Structures, Oscillations, or Cycles.

graphic
 
graphic
 
Figure 3:

Eigenvalues of type i to iii and i’ to iii’ plotted based on where they lie on the unit circle, with the same color coding as in Table 1.

Figure 3:

Eigenvalues of type i to iii and i’ to iii’ plotted based on where they lie on the unit circle, with the same color coding as in Table 1.

Close modal

Before moving on, a couple of details are worth pointing out. First, the above analysis is possible only if the transition matrix P is diagonalizable. One case in which this is guaranteed to hold is for symmetric matrices. However, given that transition probabilities are rarely pairwise symmetric (i.e., Pij = Pji), this restriction is clearly too strong. Thus, a more detailed investigation is needed in order to identify the conditions under which the above decomposition can be made, which we do in sections 3 and 4. For a more in-depth account of the theory of diagonalizable matrices, we recommend Meyer (2000). Second, a quick check reveals that all terms in equation 2.54 are dependent on the components wj that describe the starting distribution (see equation 2.47). Because of this, in the most general case, both the persistent and transient behavior of a Markov chain can be sensitive to initial conditions. In a later section, we consider a particular type of Markov chain for which this is not the case, and provide a simplified analysis of their evolution over time.

2.4  Classification of States

In the following sections, we explore three types of Markov chains. In order to describe each type in detail, we first define various properties that apply to individual states, or sets thereof, in a state space S. This is the focus of this section.

2.4.1  Communicating Classes

We start by making the following definitions related to how states in S are connected:

Definition 3

(Accessibility). For two states si, sjS, we say that sj is accessible from si, denoted sisj, when it is possible to reach sj from si in k ≥ 0 steps, k:(Pk)ij>0.

Definition 4

(Communication). Two states si, sjS, are said to communicate if sisj and sjsi, which is denoted sisj.

Communication is a useful property for describing states in a Markov chain, as exemplified by the following result:

Proposition 1

(Communicating Class). Communication is an equivalence relation, meaning that:

  • siS, sisi, since by definition each state can reach itself in 0 steps, (P0)ii=(1)ii=1>0,

  • if sisj, then sjsi,

  • if sisj and sjsk, then sisk,

and the state space S can be partitioned into communicating classes, each containing states that all communicate with one another.

Furthermore, we can make a useful categorization of Markov chains based on the number of communicating classes they have:

Definition 5

(Number of Communicating Classes). Let n be the number of communicating classes of a Markov chain. When n = 1, the chain is said to be irreducible; otherwise it is reducible.

In words, an irreducible Markov chain is one in which for any pair of states, there exists a connecting path in both directions. In Figure 4, three example Markov chains are shown, with the communicating classes indicated by the dashed boxes. Take a moment to double-check why each state belongs to its communicating class, and verify that the examples in Figures 4a and 4b are reducible, whereas the one in Figure 4c is irreducible. Finally, observe from Figure 4b that it is possible for states in one communicating class to be accessible from states in another class (e.g., s3s4).

Figure 4:

Three Markov chains with differing connectivity, each with N = 6 states.

Figure 4:

Three Markov chains with differing connectivity, each with N = 6 states.

Close modal

Irreducible Markov chains feature in a number of subsequent sections of this tutorial due to the following result:

Theorem 1.

A Markov chain is irreducible if and only if it has a unique stationary distribution π. Furthermore, this distribution has strictly positive elements, i.e. πi>0siS.

This result can be applied to the example of Figure 4c by using the observation of the previous section that stationary distributions of a given Markov chain are eigenvectors of the associated transition matrix with eigenvalue 1. If we were to find the transition matrix of the chain in Figure 4c and compute its eigenvectors, we would indeed find a single left eigenspace of P corresponding to eigenvalue 1, with strictly positive elements. When normalized such that the row sum is 1, the resulting vector is π=(π1,π2,π3,π4,π5,π6)T=(0.09,0.09,0.07,0.31,0.18,0.26)T, which we encourage readers to check themselves. For reducible Markov chains, the guarantees of uniqueness and positivity of the stationary probabilities no longer hold. In order to explore the stationary distributions of such chains, we introduce some new concepts for describing states.

2.4.2  Recurrence and Transience

Each state siS can be categorized based on how likely it is to be revisited, given that it is currently occupied. This is formalized by the following definition:

Definition 6
(Recurrence/Transience). Given that state si is initially occupied, the probability of returning to si is defined as
fi:= Pr (Xk=siforsomek1|X0=si).
(2.55)
States for which fi = 1 are called recurrent, and those for which fi < 1 are called transient.

This is a useful way to characterize states in S, since it generalizes to all states within a communicating class, that is, for any communicating class CS: either all states in C are recurrent or all states in C are transient. Transience and recurrence are therefore examples of class properties, and we henceforth use the terms recurrent class and transient class for communicating classes that contain recurrent and transient states, respectively. Furthermore, for finite state spaces, a Markov chain is guaranteed to have at least one recurrent class. Building on this, we can then apply the notion of recurrence to a Markov chain as a whole:

Definition 7

(Recurrent Chains). A Markov chain that contains only recurrent classes is called a recurrent Markov chain.

As an illustration, we apply these concepts to the reducible chains depicted in Figure 4. In Figure 4a, there are two communicating classes, and in each one, there is no possibility to exit. This means that if a state in one of these classes is occupied, it is guaranteed to be revisited at some future time step, that is, fi = 1 for all states in each class. Therefore, both classes are recurrent and the chain as a whole is recurrent. In Figure 4b, the main difference is that there is now the possibility to exit the blue class without returning. For example, assuming s1 is occupied, although there is a possibility that s1 can be visited again later (e.g., s1s2s1), as soon as a transition s1s4 takes place, this will no longer be possible. This is why fi < 1 for states in the blue class. Hence, while the red class is recurrent, the blue class is transient, and because of this, the chain as a whole is not recurrent.

For an irreducible Markov chain, such as the example shown in Figure 4c, every state is guaranteed to be recurrent, leading to the following proposition:

Proposition 2

(Recurrence of Irreducible Markov Chains). All irreducible Markov chains are recurrent.

However, from the example in Figure 4a, it is clear that the converse does not hold, since it is also possible for a reducible chain to be recurrent. In fact, whether a reducible chain is recurrent or not determines certain features of the stationary distributions belonging to the chain. This is outlined by the following proposition:

Proposition 3

(Stationary Distributions of Reducible Chains). For a reducible Markov chain with r recurrent classes and t transient classes:

  • Any stationary distribution has probability zero for states belonging to a transient class.

  • For the kth recurrent class, there exists a unique stationary distribution πk with nonzero probabilities only for states in that class.

  • When the number of recurrent classes r is bigger than 1, stationary distributions can be formed via convex combinations of each distribution πk,
    π combo =k=1rαkπk, with αk0 and k=1rαk=1
    (2.56)
    meaning that there is an infinite number of stationary distributions.
  • Furthermore, when the number of transient classes t is zero or, in other words when the chain is recurrent, performing the procedure above with nonzero coefficients always yields distributions that are strictly positive.

We can apply these results to the two reducible chains in Figure 4. In the example of Figure 4a, the stationary distribution associated with the blue class is π1=(0.35,0.36,0.29,0,0,0)T and the one associated to the red class is π2=(0,0,0,0.39,0.26,0.35)T. We can then generate an arbitrary number of extra stationary distributions by taking convex combinations of π1 and π2 with coefficients α1 and α2. For example, with α1 = 0.25 and α2 = 0.75, we get π=(0.09,0.09,0.07,0.29,0.2,0.26)T. Furthermore, the last bullet point in proposition 3 tells us that since this chain is recurrent, we can be sure that any convex combination with positive coefficients yields a distribution with strictly positive entries. This property of recurrent chains is particularly relevant to our treatment of both reversible chains in section 2.6 and random walks in section 4, and we henceforth use π>0 to denote a stationary distribution with this property. In the example of Figure 4b, the red class is the only recurrent class, meaning that the stationary distribution associated with this class is the only stationary distribution of the chain. Since the transition probabilities for states in this class are the same as in the example of Figure 4a, this stationary distribution is π=π2. Furthermore, in agreement with proposition 3 we see that the transient states in this chain have a stationary probability of 0.

2.4.3  Periodicity

The notion that states can be revisited is also meaningful in another sense. We define the following quantity, which describes how frequently such revisits can take place.

Definition 8
(Periodic Chains). For each state siS, the period is defined as
di:=gcd{k1:(Pk)ii>0},
(2.57)
where gcd indicates the greatest common divisor. Then, equation 2.57 says that when starting in state si, it is possible to return to si only in multiples of the period di. States for which di > 1 are called periodic and those for which di = 1 are aperiodic.

Like transience and recurrence, period is also a class property, and we use d to refer to the period of a whole class. This in turn allows us to define the period of a Markov chain:

Definition 9

(Periodicity). When all communicating classes in S have period d > 1, the Markov chain is said to be periodic, with period d.

Definition 10

(Aperiodicity). When all communicating classes in S are aperiodic, the Markov chain is said to be aperiodic.

Definition 11

(Mixed Periodicity). If S contains communicating classes with different periods, the Markov chain is said to have mixed periodicity.

Consider the example shown in Figure 5a. This chain has two communicating classes, the red one being a transient class with d = 2 and the blue one being a recurrent class with d = 1, meaning that this chain has mixed periodicity. In Figures 5b5d we show three irreducible Markov chains, with panels b and c having period 3 and 2, respectively, and panel d being aperiodic.

Figure 5:

Four Markov chains with differing periodicity, each with N = 3 states.

Figure 5:

Four Markov chains with differing periodicity, each with N = 3 states.

Close modal

Markov processes are a broad class of models, and even under the restricted settings considered in this tutorial (discrete time, homogeneous, and finite state spaces), there are many distinct types of chains. In the following sections, we concentrate on three particular types that are relevant in applied domains.

2.5  Ergodic Chains

When modeling a system that evolves over time, it is important to ask what can be said, if anything, about its long-term behavior. For a Markov chain, this question can be phrased in two ways. On one hand, we can sample a single trajectory starting from some initial state and ask what the average behavior is over time: How often is it found in each state siS for a trajectory of length t? On the other hand, we can describe our starting conditions as a distribution μ(0) and ask what this evolves to in the future: What is the probability of being in each state si at a later time t? We refer to these two notions of long-term behavior as the trajectory and distribution perspectives, respectively. While the analyses given so far predominantly use the latter perspective, we remind readers that in section 2.1, we introduced the idea of a distribution over S by taking the limit of an infinitely large ensemble of trajectories, meaning that the two concepts are closely related.

This two-way view originates from the field of statistical physics, where physical processes can either be analyzed with temporal averages (i.e., the trajectory perspective) or ensemble averages (i.e., the distribution perspective). One class of systems that has received a lot of study in this field are those for which these two types of averaging yield the same result as t → ∞. Such systems are known as ergodic systems, and this equivalence means that a statistical description of their long-term behavior can be described simply by a single, sufficiently long sample. One implication of this is that initial conditions are forgotten over time, which makes ergodic systems particularly attractive from a simulation or modeling perspective. Finally, with this in mind, we can define an ergodic Markov chain as follows:

Definition 12

(Ergodic Markov Chain). An ergodic Markov chain is one that is guaranteed to converge to a unique stationary distribution.

Clearly, in order for a chain to be ergodic, it must have a unique stationary distribution. Therefore, by virtue of theorem 1, a necessary condition for a chain to be ergodic is that it is irreducible. However, there is no guarantee that an irreducible chain converges, which is the second condition of definition 12. The convergence of a Markov chain is related to its periodicity, as explained by the following result:

Theorem 2
(Convergence of a Markov Chain). The evolution of a Markov chain with period d can lead to a permanent, repeating sequence of d distributions,
μ(k)μ(k+1)...μ(k+d-1)μ(k),
(2.58)
which for d = 2 and d > 2 correspond to persistent oscillations and persistent cycles, respectively. In the case of an aperiodic Markov chain (d = 1), only sequences of length 1 are allowed, which means that one of the stationary distributions is guaranteed to be reached.

We can apply this result to the irreducible Markov chain in Figure 5c. This chain has a unique stationary distribution π=(0.35,0.15,0.5)T and a period of d = 2. Therefore, there is no guarantee that the chain converges to π since it can get trapped in persistent oscillations. To observe this, we can try out different initial conditions and iteratively apply the update rule in equation 2.12. For example, starting with the distribution μ(0)=(0.25,0.5,0.25)T, we get a persistent oscillation between the following two distributions: μ(k)=(0.175,0.075,0.75)T and μ(k+1)=(0.525,0.225,0.25)T. However, this is not the only persistent oscillation possible for this chain, which can be observed by trying out different initial conditions. Finally, in section 3.5, we gain more insight on theorem 2 by using tools from graph theory to describe the eigenvectors of P with |λ| = 1.

A key insight from theorem 2 is that a Markov chain is guaranteed to converge only if it is aperiodic. Together with irreducibility, this therefore provides the conditions under which a chain is ergodic:

Theorem 3

(Conditions for Ergodicity). A Markov chain is ergodic if and only if it is both irreducible and aperiodic, which respectively ensure that there is a unique distribution π and the chain always converges to this distribution. Furthermore, the distribution π is said to be the limiting distribution of the chain.

Ergodic Markov chains have a number of beneficial properties. First, as with any ergodic system, the initial conditions are eventually forgotten, which means that when studying the statistical behavior of an ergodic chain, it is not necessary to explore different starting states. This advantageous property underlies a number of methods in machine learning and beyond, with Markov chain Monte Carlo methods being a particularly well-known example (Richey, 2010) (see section 5.1). Second, the equivalence between the trajectory and distribution perspectives allows us to interpret the limiting probabilities πi as the long-run fraction of steps that the chain spends in each state si (Porod, 2021). Finally, analyzing the evolution of a Markov chain in terms of the eigenvectors of its transition matrix becomes somewhat simpler in the case of an ergodic chain. We have already seen that when the transition matrix of a Markov chain is diagonalizable, we can vastly simplify the computation of the k-step evolution of the Markov chain (see equation 2.54). When the Markov chain is also ergodic, we know that its persistent behavior is fully described by l1=π, meaning that this expression simplifies to
μ(t+k)T=πT+ω=2NcωλωklωT
(2.59)

2.6  Reversible Chains

One of the defining characteristics of Markov chains is that the future (X) is conditionally independent of the past (Z), given the present (Y). A simple calculation demonstrates that the relationship of conditional independence is symmetric. Assuming that Pr(X|Y, Z) = Pr(X|Y), we find that
Pr(Z|Y,X)=Pr(X,Y,Z)Pr(Y,X)
(2.60)
=Pr(X|Y,Z)Pr(Y,Z)Pr(X|Y)Pr(Y)
(2.61)
=Pr(X|Y)Pr(Y,Z)Pr(X|Y)Pr(Y)(byassumption)
(2.62)
=Pr(Y,Z)Pr(Y)
(2.63)
=Pr(Z|Y).
(2.64)
Assuming that we have an initial Markov chain X described by a transition matrix P, this symmetry tells us that reversing the direction of time produces a new process that itself satisfies the Markov property. Hence, we can define this new Markov chain as the time-reversal of X, which we denote as X˜. A natural question we can ask is how the transition matrix of X˜ is related to P. In order to answer this, we make the assumption that at time t, the chain is described by one of its stationary distributions π. Therefore:
Pr(Xt=si)=πisiSandt0.
(2.65)
We can make use of this, together with Bayes’ theorem, to work out the transition probabilities (Prev)ij of X˜ (Brémaud, 1999):
(Prev)ij=Pr(Xt=sj|Xt+1=si)=Pr(Xt+1=si|Xt=sj)Pr(Xt=sj)Pr(Xt+1=si)=Pjiπjπi.
(2.66)
A couple of reflections can be made about the result above. First, since we initialized X to one of its stationary distributions, the time-reversal relationship between X and X˜ holds only once they have converged. Second, since we have πi in the denominator of equation 2.66, the time reversal of a Markov chain is only valid for a starting distribution with πi>0siS. In section 2.4.1, we have established that only recurrent chains have stationary distributions of this type, meaning that the time reversal of a Markov chain is well defined only if the chain is recurrent. Finally, using equation 2.66, it is possible to define Prev in matrix notation as follows:
Definition 13
(Time Reversal). Let X be a recurrent Markov chain with transition matrix P and π>0 one of its stationary distributions. Then the transition matrix of its time reversal X˜ is given by
P rev :=Π-1PTΠ,
(2.67)

It is worth emphasizing that since no assumption is made in definition 13 about the number of communicating classes, it also applies in the case of reducible recurrent chains where there is an infinite number of distributions π>0 to choose from. However, in such cases, the choice of π makes no difference:

Proposition 4

(Time Reversal of Reducible Chains). For a reducible recurrent Markov chain X, the time reversal X˜ is uniquely defined, with P rev being independent of which stationary distribution π>0 is used (for the proof, see the appendix).

Moreover, the set of stationary distributions π>0 belonging to a recurrent Markov chain is the same as the set belonging to the corresponding time reversal:

Proposition 5

(Stationary Distributions of Time Reversal). Let X be a recurrent Markov chain. Then π>0 is a stationary distribution of X˜ if and only if it is a stationary distribution of X.2 (For the proof, see the appendix.)

We now consider the special case where a Markov chain X is indistinguishable from its time reversal X˜. Such processes are known as reversible Markov chains, since in any stationary distribution, the forward and backward dynamics of the chain are statistically equivalent: any trajectory X1, X2, . . . , Xk − 1, Xk occurs with equal probability as the corresponding reversed trajectory Xk, Xk − 1, . . . , X2, X1. The stationary dynamics of such chains therefore has no inherent arrow of time. Furthermore, since X and X˜ are indistinguishable, they have the same transition matrix, P=Prev, and for any stationary distribution π>0, the forward and backward flow matrices are the same. Therefore
Fπ=Frevπ,
(2.68)
ΠP=ΠPrev,
(2.69)
ΠP=(2.67)PTΠ.
(2.70)
By expressing equation 2.70 in component form, we arrive at the following theorem, which we use throughout the rest of the tutorial:
Theorem 4
(Detailed Balance). A recurrent Markov chain is reversible if and only if for any stationary distribution π>0:
πiPij=πjPjisi,sjS,
(2.71)
which are known as the equations of detailed balance.

A number of observations can be made about this definition. First, the left (right) terms represent the flow of probability from si to sj (sj to si), given that the chain is described by distribution π. Thus, for a reversible Markov chain in one of its stationary distributions, the flow from one state to another is completely balanced by the flow in the reverse direction, meaning that the flow matrix Fπ is always symmetric for such chains. By comparison with equation 2.19, we see that detailed balance is a stronger condition than global balance, since in the latter case, there is only an equivalence between the total flow in and out of each state. Second, since πi and πj are nonzero, it follows that Pij ≠ 0 if and only if Pji ≠ 0. Thus, the transition structure of a reversible Markov chain always permits the return to the previous state, and because of this, the period of a reversible Markov chain can be at most 2. Third, while some sources assume irreducibility as a precondition of reversibility, we instead base our definition on the weaker condition of recurrence (Porod, 2021). This is due to the fact that we only need recurrence in order to define the time reversal of a Markov chain. Furthermore, with this convention, theorem 4 applies more broadly to reducible Markov chains, which lets us make a closer comparison between reversible Markov chains and undirected graphs in section 4. Finally, theorem 4 implies that there are two distinct ways in which Markov chains can be nonreversible: (1) they can be recurrent without satisfying detailed balance, or (2) they can be nonrecurrent. In case 1, ΠPPTΠ for any distribution π>0, meaning that the flow matrix Fπ is asymmetric, and in case 2, no positive stationary distribution exists. Finally, note that for a nonrecurrent chain, there exists the possibility that Fπ is symmetric for all stationary distributions despite none of those distributions being strictly positive. For such chains, removing all transient states from S produces a reversible chain. We therefore refer to such chains as semireversible.

To illustrate some of these points, we consider four Markov chains in Figure 6, with panels a, d, g, and j showing the transition graphs of each example, and panels b, e, h, and k showing the respective transition matrix, stationary distribution, and flow matrix (for simplicity, each example has a single recurrent class, so that both the stationary distribution and the associated flow matrix are unique). As a visual illustration of the pairwise stationary flow between states in each example, in panels c, f, i, and l, the stationary distribution is represented as a bar plot, with the portions of probability mass flowing in (left) and out (right) of each state shown as portions of each bar. The example in panel a is reversible, as can be seen by the symmetry of F1π in panel b, or equivalently by the matching between left and right portions of all bars in panel c. The example in panel d is almost equivalent to the one in panel a, except that the outgoing transition probabilities from s1 have been slightly modified (indicated by the colored arrows in panels a and d and the colored entries of P1 and P2 in panels b and e). This modification is enough to violate detailed balance, as can be seen by the asymmetry of F2π or the bar plot in panel f. Finally, the chains depicted in panels g and j are nonrecurrent since in both cases, state s3 has only outgoing transitions. Therefore, π3 = 0, and both examples are nonreversible. However, a quick check of F3π or the bar plot in panel i reveals that the example in panel g is semireversible. Conversely, the example in panel h is identical to the one in panel g except for the outgoing transitions from s4 (again indicated by the colored arrows in panels g and j and the colored entries of P3 and P4 in panels h and k), which leads to an asymmetric stationary flow between the recurrent states.

Figure 6:

Three example Markov chains: (a) reversible, (d) recurrent and nonreversible, (g) semireversible, and (j) nonrecurrent and nonreversible. In panels b, e, h, and k, the transition matrices, stationary distributions, and flow matrices of each example are shown, with numerical values rounded to three decimal places. In panels c, f, i, and l, the stationary distributions are visualized by a bar plot, with states represented along the x-axis. The right (left) portion of each bar represents the outgoing (incoming) flow of probability mass to (from) all other states. The color of each segment indicates which state the probability mass is flowing to or from, with s1, s2, s3, and s4 depicted in blue, orange, green, and red, respectively.

Figure 6:

Three example Markov chains: (a) reversible, (d) recurrent and nonreversible, (g) semireversible, and (j) nonrecurrent and nonreversible. In panels b, e, h, and k, the transition matrices, stationary distributions, and flow matrices of each example are shown, with numerical values rounded to three decimal places. In panels c, f, i, and l, the stationary distributions are visualized by a bar plot, with states represented along the x-axis. The right (left) portion of each bar represents the outgoing (incoming) flow of probability mass to (from) all other states. The color of each segment indicates which state the probability mass is flowing to or from, with s1, s2, s3, and s4 depicted in blue, orange, green, and red, respectively.

Close modal

It is worth pointing out that in our analysis above, we check for reversibility by inspecting the stationary distributions and the corresponding flow matrices of each example. However, since reversibility is a property associated with Markov chains and not with distributions, one might wonder whether there is an alternative way to formalize it based purely on the transition probabilities Pij. Clearly, equation 2.71 prohibits one-way transitions (i.e., Pij > 0 and Pji = 0), but this is only a necessary condition of reversibility. Can we offer anything more precise? Fortunately, the answer is yes, and it is given by Kolmogorov’s criterion (Kolmogoroff, 1936):

Theorem 5
(Kolmogorov’s Criterion). A recurrent Markov chain is reversible if and only if the product of one-step transition probabilities along any finite closed path of length more than two is the same as the product of one-step transition probabilities along the reversed path. In other words:
Pi1,i2Pi2,i3Pin-1,inPin,i1=Pi1,inPin,in-1Pi3,i2Pi2,i1
(2.72)
for any n ≥ 2 and any sequence of states si1, si2, si3, . . . , sin-1, sinS.

One way to understand this theorem is that for reversible Markov chains, the probability of traversing any closed path in the state space S is independent of the direction of traversal. Hence, reversible Markov chains can be thought of as having zero net circulation. By contrast, recurrent Markov chains that are nonreversible have at least one path that violates equation 2.72, over which there is a higher probability to traverse in one direction than the other. For the example in Figure 6a, the relevant closed paths are (up to a cyclic permutation): (1) s1s2s4s3s1, (2) s1s2s4s1, and (3) s1s4s3s1. In any of these cases, going around clockwise is equally probable as going around counterclockwise, which is to be expected since this Markov chain is reversible. The example in Figure 6d has the same closed paths available, except that the outgoing transition probabilities from s1 have been changed. This small adjustment is enough to introduce circulation on all the closed paths: for both paths 1 and 3 the counterclockwise direction is more probable since P13P34P42P21 > P12P24P43P31 and P13P34P41 > P14P43P31, respectively, and for path 2, the clockwise direction is more probable since P12P24P41 > P14P42P21. Therefore, by virtue of having at least one path with net circulation, equation 2.72 confirms that this chain is indeed nonreversible.

These analyses illustrate how theorems 4 and 5 provide two alternative but equivalent definitions of reversibility. Something common to both of these interpretations is that reversible Markov chains satisfy a type of equilibrium, either between the exchange of probability mass between pairs of states or the circulation along closed paths, respectively. In fact, the concept of detailed balance stems from early work in the field of statistical mechanics aimed at formalizing the notion of thermodynamic equilibrium on a microscopic level (Gorban, 2014). More recently, Markov chain Monte Carlo methods, which are predominantly based on reversible ergodic chains (see section 5.1), have received widespread application in the natural sciences as a way to model systems that are in thermodynamic equilibrium (Richey, 2010). Conversely, Markov chains that violate detailed balance, or equivalently those with net circulation, have been applied to the less-well-understood case of systems which are out of equilibrium (Jiang et al., 2004; Zhang et al., 2012; Ge et al., 2012). Furthermore, their stationary distributions have been referred to as nonequilibrium steady states (NESS) (Jiang et al., 2004; Zhang et al., 2012; Ge et al., 2012; Conrad et al., 2016; Witzig et al., 2018), which reflects the fact that such distributions are kept fixed over time via unequal flows of probability mass between states (π2 is an example of a NESS, as can be seen in the bar plot in Figure 6f).

Reversible Markov chains are significantly easier to treat both analytically and numerically than nonreversible chains. Because of this, there exist various procedures for modifying a nonreversible Markov chain so that it becomes reversible, which is sometimes referred to as reversibilization (Fill, 1991; Brémaud, 1999). For a recurrent chain, this can be done by taking an average of the forward and backward transition probabilities, P and Prev, that describe the chain and its time reversal, respectively. This averaging process can be either additive or multiplicative, leading to the following two definitions:

Definition 14
(Additive Reversibilization). Let X be a recurrent nonreversible Markov chain with transition matrix P and a stationary distribution π>0. Then the additive reversibilization of X is a chain with the following transition matrix:
PA:=P+P rev 2
(2.73)
=P+Π-1PTΠ2,
(2.74)
and for which π is also a stationary distribution.
Definition 15
(Multiplicative Reversibilization). For a nonreversible Markov chain with transition matrix P and a strictly positive stationary distribution π, the multiplicative reversibilization produces a Markov chain described by the following transition matrix:
PM:=PP rev
(2.75)
=PΠ-1PTΠ,
(2.76)
and for which π is also a stationary distribution.
Since both definitions produce chains that have the same set of stationary distributions as the starting chain, a simple way to verify their reversibility is to calculate the flow matrix for the distribution π. For the additive reversibilization this gives
FAπ=ΠPA
(2.77)
=ΠP+PTΠ2
(2.78)
=ΠP+(ΠP)T2
(2.79)
=Fπ+(Fπ)T2
(2.80)
and for the multiplicative reversibilization,
FMπ=ΠPM
(2.81)
=ΠPΠ-1PTΠ
(2.82)
=(ΠPΠ-12)(Π-12PTΠ)
(2.83)
=(ΠPΠ-12)(ΠPΠ-12)T,
(2.84)
both of which are by definition symmetric. While equation 2.84 does not admit a simple interpretation, equation 2.80 says that for the additive reversibilization, the flow matrix is symmetric because it corresponds to an average of the forward flow, Fπ, and backward flow, (Fπ)T, of the starting chain X. This interpretation is used when we consider random walks on directed graphs in section 4.3.

2.7  Absorbing Chains

Finally, one concept in the theory of Markov chains that is particularly relevant to applied domains is absorption. A state siS is called absorbing if it is possible to transition into the state but not out of it, meaning that Pii = 1 and the chain stays in si for all future time steps. An absorbing Markov chain is one for which from every state siS, there exists some path to an absorbing state. Since it is possible to start in a nonabsorbing state and never return, all nonabsorbing states are transient, and the presence of such states means that absorbing chains can be neither reversible nor ergodic. Absorbing chains often occur in Markov decision processses (MDPs), which are central to the field of reinforcement learning (Sutton & Barto, 2018).

The possible transitions in an absorbing chain can be partitioned into three types: (1) transient → transient, (2) transient → absorbing, and (3) absorbing → absorbing. Although the assignment of indices to states in S is arbitrary, an assignment based on this partitioning simplifies the analysis of absorbing Markov chains.

Definition 16
(Canonical Form). For an absorbing Markov chain with r absorbing and t transient states, the transition matrix P can be arranged to have the following block structure, known as the canonical form:
P=QR01,
(2.85)
where QRt×t, RRt×r, and 1Rr×r describe transitions of types 1, 2, and 3, respectively, and 0Rr×t is a matrix of zeros. Thus, matrix Q is what remains of P when we remove any absorbing states from S.

We depict this partitioning of transition probabilities in Figure 7a. An absorbing chain with one absorbing state is shown, with the transitions belonging to Q, R, and 1 colored black, red, and blue, respectively. Furthermore, in Figure 7b, we show the matrices Q and R.

Figure 7:

(a) The transition graph of an absorbing Markov chain with four states, where s4 is an absorbing state and (b) the matrices Q and R associated with this chain.

Figure 7:

(a) The transition graph of an absorbing Markov chain with four states, where s4 is an absorbing state and (b) the matrices Q and R associated with this chain.

Close modal

Since any transient state can reach an absorbing state in a finite number of steps, the probability that the chain ends up in an absorbing state at some future time is 1. For this reason, in the infinite time limit, we can expect to see no transitions taking place between transient states, that is, limnQn=0. This is an advantageous property, since it means that if we sum up all powers of Q, known as the Neumann series of Q, then the contributions for larger powers get progressively smaller and the sum converges to (1-Q)-1 (see Meyer, 2000, p. 618). Calculating this sum for Q leads to the following useful quantity, which relates transient states in S (Porod, 2021):

Definition 17
(Fundamental Matrix). For any absorbing Markov chain, the Neumann series of matrix Q is given by
N:=k=0Qk
(2.86)
=1+Q+Q2+Q3+...
(2.87)
=(1-Q)-1
(2.88)
and is known as the fundamental matrix of the Markov chain. The elements of this matrix Nij give the expected number of times the chain visits a transient state sj before absorption, given that the chain started in a transient state si.

When analyzing an absorbing chain, it is very handy to have access to the fundamental matrix. By taking into account all nonnegative powers of Q, it contains information about all possible paths available between pairs of transient states. Because of this, it is a useful predictive tool that allows several properties of the Markov chain to be deduced (Porod, 2021). Furthermore, in the field of reinforcement learning, it is closely related to the successor representation (Dayan, 1993) (see section 5.2).

2.8  Summary

This concludes our exploration of different types of Markov chains. In Figure 8, we provide a summary of the material presented in this section in the form of a Venn diagram. In this diagram, each type of Markov chain is drawn as a circle or ellipse, with defining properties and results listed in each case. Take a moment to look at this image, and pay attention to the overlapping regions, which indicate how different types of chains are related. For a more in-depth presentation of the material in this section, we recommend Porod (2021) and Brémaud (1999).

Figure 8:

A Venn diagram composed of the different types of Markov chains introduced in this section.

Figure 8:

A Venn diagram composed of the different types of Markov chains introduced in this section.

Close modal

In the next section, we introduce graphs as an alternative way to describe Markov chains and summarize insights that emerge from this description. Then, in section 4, we explore the connection between graphs and Markov chains in more depth using the notion of random walks, which allows various relationships to be made between specific types of graphs and some types of Markov chains introduced in this section.

So far, we have implicitly been interpreting Markov chains as graphs whenever we draw a transition graph. In this section, we formally introduce the concept of graphs, which provides a foundation to the material on random walks in section 4. Readers should note that definitions in graph theory often vary among different sources. Here we use a convention that can encompass a wider variety of graphs, thereby offering greater generality.

3.1  Definition

A graph G = (V, E) is a set of N vertices V = {v1, v2, . . . , vN} together with an edge set E containing pairs of vertices in V. Conceptually, V might represent a collection of objects and E a specification of how some pairs in this collection are related to one another. A natural way to categorize graphs is based on the way in which edges are defined. For instance, in an undirected graph, each edge has no direction and is typically denoted as (vi, vj) ∈ E, whereas in a directed graph, each edge has a specified starting and ending vertex and is usually denoted as (vivj) ∈ E. Examples of undirected and directed graphs can be seen in the first and second rows of Figure 9, respectively. Unless otherwise stated, we depict undirected edges as straight lines and directed edges as curved lines with arrowheads indicating the direction. A second distinction we can make is between unweighted graphs, in which one only cares about whether two vertices are related or not, and weighted graphs, in which each edge has a positive weight wij describing the strength of the relationship.3 In Figure 9, the examples in the first column are unweighted and all other graphs are weighted, with weights indicated by numbers next to each edge. The type of edges that a graph has is often chosen based on the type of relationship that one wants to describe. For example, assume that we have a graph G where vertices represent PhD students. Then, if we want to represent the relationship of being in the same research group, undirected, unweighted edges are a natural choice (such as Figure 9a). Conversely, if we want edges to describe whether one student has participated on a main project of another student, then this clearly requires directed unweighted edges (such as Figure 9d). If we now consider variants of the first and second examples, instead focusing on how similar the research topics of two students are or how much work one student has contributed to another student’s project, then we now need undirected weighted and directed weighted edges, respectively (such as Figures 9b and 9c and Figures 9e and 9f). It is worth noting that in order to assign weights to edges, one needs to specify a scale on which to measure the strength of relationship between vertices.

Figure 9:

Six examples of graphs and their corresponding matrices. (a–c) Undirected. (d–f) Directed. (a, d) Unweighted. (b,c,e,f) Weighted.

Figure 9:

Six examples of graphs and their corresponding matrices. (a–c) Undirected. (d–f) Directed. (a, d) Unweighted. (b,c,e,f) Weighted.

Close modal

One can also describe graphs based on their connectivity. In an undirected graph, if there exists a path between each pair of vertices, then the graph is said to be connected; otherwise it is disconnected. The notion of connectivity can generalize to disconnected graphs if we instead consider subsets of vertices in G, which are known as subgraphs. Any subgraph that is connected but is not part of any larger connected subgraph is called a connected component. Both of the undirected graphs in Figures 9a and 9b are connected, whereas Figure 9c shows an example that is disconnected, with two connected components. In particular, this latter example even has a vertex that does not have any edges at all; it is known as an isolated vertex. For a directed graph, if there are directed paths running from vi to vj and from vj to vi for all pairs of vertices vi, vjV, then the graph is said to be strongly connected. Alternatively, a directed graph is weakly connected if for all pairs of vertices vi, vjV, it is possible to get from vi to vj and from vj to vi by any path, ignoring the direction of the edges. Clearly, a directed graph is weakly connected if it is strongly connected, but not vice versa. Furthermore, strongly or weakly connected subgraphs that are not part of any larger such subgraphs are referred to as strongly or weakly connected components, respectively. The directed graphs in Figures 9d and 9e are strongly connected, whereas the one in Figure 9f is only weakly connected and has two strongly connected components (take a moment to verify this).

3.2  Matrix Representation

A natural way to numerically represent graphs with |V| = N vertices is with an N × N matrix. In the unweighted case, this matrix is a binary matrix A, with entries
Aij:=1if(vivj)E,0otherwise
(3.1)
where (vivj) = (vi, vj) when G is undirected and (vivj) = (vivj) when it is directed. Furthermore, the matrix A is usually referred to as the adjacency matrix of G. This extends easily to the weighted case, where instead we have a nonnegative matrix W, with entries
Wij:=wijif(vivj)E,0otherwise
(3.2)
which is often referred to as the weight matrix of G. In Figure 9, the relevant matrix is shown below each graph, and we encourage readers to verify that they match in each case. Furthermore, something worth pointing out is that undirected graphs always have corresponding matrices that are symmetric, as can be seen in Figures 9a9c.

For the remainder of this tutorial, we assume that the graphs we deal with are both weighted and directed. The reason we choose this convention is that it is more general. On the one hand, any unweighted graph can be considered as a special case of a weighted graph where the weights are all set to 1. Thus, we henceforth only talk about weight matrices as opposed to adjacency matrices when describing graphs numerically. On the other hand, there is one sense in which directed graphs can be thought of as a generalization of undirected graphs. If vi and vj are two distinct vertices that share an edge in an undirected graph, then the weight of this edge is guaranteed to appear twice in the weight matrix W by virtue of being symmetric. If instead these vertices belong to a directed graph and there is an edge (vivj), then this edge appears only once in W. Therefore, it is possible to interpret an undirected edge between vi and vj as being equivalent to a pair of directed edges of the same weight, with one connecting vi to vj and the other connecting vj to vi. This is the interpretation we use throughout the rest of the tutorial whenever we refer to undirected graphs. As an example, in Figure 10, we show two equivalent depictions of an undirected graph, with the top image drawn in the usual way and the bottom image drawn using pairs of directed edges. Below these two drawings, the weight matrix of this graph is shown. One must note that interpreting undirected graphs in this way is somewhat atypical; however, it allows us a greater level of generality when dealing with different types of graphs in section 4. Furthermore, this interpretation only applies to edges between distinct vertices; edges that connect vertices to themselves are discussed in section 3.4.

Figure 10:

An undirected graph represented with both undirected edges (top) and pairs of directed edges (middle), as well as its weight matrix (bottom).

Figure 10:

An undirected graph represented with both undirected edges (top) and pairs of directed edges (middle), as well as its weight matrix (bottom).

Close modal

3.3  Vertex Degrees

Once the weight matrix of a graph is known, it is easy to calculate the total weight coming in and out of each vertex. The total incoming weight of a vertex vi can be found by summing over the ith column of W and is known as the in-degree of vi: di-:=j=1NWji. Conversely, the total outgoing weight of vi is calculated by the sum over the ith row of W, di+:=j=1NWij, and is known as the out-degree of vi. Since undirected graphs always have bidirectional edges and symmetric weight matrices, the in- and out-degrees of such graphs are always equal and are simply referred to as vertex degrees, denoted by di.4 For example, in the graph of Figure 10 the degrees are d1 = 3, d2 = 4, and d3 = 1. In the more general case of directed graphs, there is no guarantee that di+=di-. However, summing over all in- or out- degrees for any graph always produces the same number, vol(G)=i=1Ndi+=i=1Ndi-=i=1Nj=1NWij, which is sometimes referred to as the volume of G. As an example, consider the directed graph in Figure 9e: each vertex has different in- and out-degrees, i.e. d1-=1, d1+=0.5, d2-=2.5, d2+=3, d3-=0.8, d3+=1, d4-=2, and d4+=1.8, but summing over either of the degree types yields vol(G) = 6.3. Nonetheless, some directed graphs can have di+=di- for each vertex, and such cases are known as balanced graphs (Banderier & Dobrow, 2000; Aldous & Fill, 2002). In keeping with the notation of undirected graphs, we denote the vertex degrees of a balanced graph as di=di+=di-. An example of a balanced graph along with its corresponding weight matrix is shown in Figure 11, and a quick check reveals that summing over the rows or columns of W indeed yields the same values. Just as a balanced graph is a special case of a directed graph, we can similarly say that an undirected graph is a special case of a balanced graph, and this interpretation is important in section 4.

Figure 11:

A balanced graph (top) and its weight matrix (bottom).

Figure 11:

A balanced graph (top) and its weight matrix (bottom).

Close modal

3.4  Self-Loops

In the examples considered so far, all edges connect pairs of vertices that are distinct, that is, vivj. While this is sometimes enforced as a rule, some conventions also allow edges to connect vertices to themselves, which are known as self-loops. For undirected graphs, the standard convention is that a self-loop at vertex vi counts doubly to the vertex degree di, while other edges only count singly, that is, di = 2 × Wii + ∑jiWij. This somewhat counterintuitive property is typically demonstrated using the degree sum formula (West, 2001). For undirected graphs, this states that each edge contributes twice its weight to the volume. Since self-loops only involve a single vertex, the only way that this rule can be respected is if they count twice as much to the vertex degrees as other edges. A property that we require when dealing with undirected graphs in section 4 is that the vertex degrees are calculated by the row sums of W. Clearly, this property is violated by the factor of 2 that applies to undirected self-loops. As a result, in this tutorial we assume that self-loops are always directed, regardless of whether they occur in undirected or directed graphs. This is an atypical definition, since undirected graphs typically are not allowed to have directed edges. However, as can be seen from the examples in Figure 12, this preserves the fact that undirected graphs have symmetric weight matrices, whereas directed graphs have nonsymmetric weight matrices, which is sufficient for the scope of this tutorial.

Figure 12:

(a) An undirected graph containing self-loops and its weight matrix (bottom). (b) A directed graph containing self-loops (top) and its weight matrix (bottom).

Figure 12:

(a) An undirected graph containing self-loops and its weight matrix (bottom). (b) A directed graph containing self-loops (top) and its weight matrix (bottom).

Close modal

We close this section by noting some similarities between our definitions of Markov chains and graphs. First, the transition matrices of Markov chains, like the weight matrices of graphs, are nonnegative. Second, in a directed graph, any entry Wij ≠ 0 of the weight matrix describes an outgoing edge from vertex vi to vj, and analogously, any entry Pij ≠ 0 of a transition matrix describes an outgoing transition probability from si to sj. Putting these together, we see that in the most general sense, any Markov chain can be thought of as a directed graph, with P being the associated weight matrix. Indeed, this interpretation is precisely what justifies us in visualizing a Markov chain by its transition graph. In the next section, we present some useful results that emerge as a result of this way of thinking about a Markov chain. Finally, for a comprehensive text on graph theory that covers much of the material in this section, we recommend West (2001).

3.5  Eigenspaces of Transition Matrices

Nonnegative matrices have received widespread attention in mathematics, and in particular their eigenvalues and eigenvectors are the focus of spectral graph theory (Chung, 1997). In this section, we apply some results from this field to transition matrices, considering first irreducible chains and subsequently exploring the generalization to reducible chains.

3.5.1  Irreducible Chains

A fundamental result used in spectral graph theory is the Perron-Frobenius theorem, and while a full treatment of it is beyond the scope of this tutorial, we now summarize its key implications for transition matrices of irreducible Markov chains.

Theorem 6

(Perron Frobenius Theorem for Irreducible Markov Chains). If P is the transition matrix of an irreducible Markov chain, then:

  • λ = 1 is guaranteed to be an eigenvalue.

  • λ = 1 is a simple eigenvalue, meaning that it occurs only once.

  • Upon suitable normalization, the eigenvalue λ = 1 has a left eigenvector equal to the unique stationary distribution π=(π1,π2,...,πN)T and a right eigenvector equal to η=(1,1,...,1)T.

  • All other eigenvalues have |λ| ≤ 1, where |·| is the complex modulus, meaning that the spectral radius of P is 1.5

To illustrate the above theorem, in Figures 13a to 13c, we show the transition graphs and eigenvalue plots of three irreducible Markov chains. The first observation to make is that, in agreement with theorem 6, λ = 1 is an eigenvalue in each case and occurs only once. Furthermore, as a quick exercise, we encourage readers to find the eigenvectors of λ = 1 for each example and normalize them to obtain π and η. Finally, the eigenvalue plots show that in each example, all eigenvalues indeed lie either on the unit circle (|λ| = 1) or within it (|λ| < 1).

Figure 13:

Four Markov chains, together with plots showing the eigenvalues of their transition matrices.

Figure 13:

Four Markov chains, together with plots showing the eigenvalues of their transition matrices.

Close modal
Other than λ = 1, theorem 6 implies that the following two types of eigenvalues are also possible: those that lie within the unit circle (i.e., |λ| < 1, transient) and those that lie on other points of the unit circle (i.e., |λ| = 1, persistent). In either case, equation 2.45 tells us that if lω is a corresponding left eigenvector, then it must be orthogonal to the unique right eigenvector with λ = 1, which is η. Therefore,
lωTη=i=1Nlω,i=0,
(3.3)
where lω, i denotes the ith component of lω. Consequently, left eigenvectors with λ ≠ 1 sum to zero, meaning that unlike stationary distributions they are not probability vectors.

Using our terminology from section 2.3, eigenvalues within the unit circle represent transient structures, transient oscillations, and transient cycles. Of the irreducible chains in Figure 13, only panels a and b have eigenvalues of this type, and in both cases, they are complex conjugate pairs describing transient cycles. Looking at the transition probabilities in each example, it is clear that these transient cycles flow clockwise around the state space.

On the other hand, eigenvalues on the unit circle other than λ=1 represent persistent oscillations and persistent cycles. Theorem 2 tells us that these are possible only when a chain is periodic. The following result sheds light on this by relating the eigenvalues with |λ| = 1 to the period of a chain (Gebali, 2008):

Proposition 6.
If P is the transition matrix of an irreducible Markov chain with period d, then there are d distinct eigenvalues with modulus 1, given by
λ1=1,λ2=φ,λ3=φ2,...,λd=φd-1,
(3.4)
where φ=e2πid.

In simple terms, proposition 6 says that the eigenvalues of P with modulus 1 are always dth roots of unity. We can verify this by checking the periodic examples in Figures 13b and 13c. In both cases, the number of eigenvalues on the unit circle is indeed equal to the period of the chain and they are also equally spaced. Furthermore, proposition 6 offers an alternative perspective on how the periodicity affects the persistent behavior of a Markov chain. For example, the chain in Figure 13a has only a single eigenvalue on the unit circle, corresponding to its unique stationary distribution. It is therefore guaranteed to end up in this distribution since all other eigenvalues have |λ| < 1. This is equivalent to the statement that this chain is ergodic, which a quick check of the transition graph confirms. Conversely, the chain in Figure 13b has an additional eigenvalue λ = eπi = −1 on the unit circle by virtue of the fact that it has period 2. Therefore, its persistent behavior can only be fully described using both the unique stationary distribution π and the eigenvector y associated with λ = −1. For example, we know from theorem 2 that such a chain can get trapped in a persistent oscillation (i.e., μ1μ2μ1μ2...). For any such oscillation, μ1 and μ2 can always be expressed as a linear combination of π and y, meaning that this sequence indeed oscillates between two points in the space spanned by these eigenvectors. While this example only involves real eigenvalues and therefore only real eigenvectors, the interpretation extends to d > 2, for which proposition 6 tells us that there must be complex eigenvalues with |λ| = 1. For example, the chain in Figure 13c has period d = 3, and it has the following three eigenvalues on the unit circle: λ1 = 1, λ2=e2πi3, and λ3=e4πi3. Analogous to the d = 2 case, any persistent cycle of this chain can be expressed using the three corresponding eigenvectors, which in the case of λ2 and λ3 must have complex entries. Rather interestingly, this means that for chains with period d > 2, persistent cycles are cycles in a complex space despite being sequences of real valued distributions.

3.5.2  Reducible Chains

Applying spectral graph theory to the transition matrices of reducible Markov chains produces a weaker set of results. For example, the generalization of theorem 6 to the reducible chains is the following:

Theorem 7

(Perron Frobenius Theorem for Reducible Markov Chains). If P is the transition matrix of a reducible Markov chain, then:

  • λ = 1 is guaranteed to be an eigenvalue.

  • The number of linearly independent eigenvectors with λ = 1 is equal to the number r of recurrent communicating classes in the Markov chain.

  • There are many choices of left and right eigenvectors for λ = 1. However, a convenient choice that mirrors the irreducible case is to choose πk and ηk as a pair of left and right eigenvectors for each of the recurrent communicating classes, with πk being the unique stationary distribution associated with each class and ηk being an indicator vector with entry 1 for states in this class and zeros elsewhere.

  • All other eigenvalues have |λ| ≤ 1, where |·| is the complex modulus, meaning that the spectral radius of P is 1.

To understand theorem 7 in more depth, consider the example in Figure 13d. This Markov chain has two recurrent communicating classes, which means that λ = 1 has a multiplicity of 2. We indicate this on the eigenvalue plot by a larger size circle. Furthermore, we color this circle half red and half blue to reflect the fact that we can choose the eigenvectors for λ = 1 based on the two recurrent communicating classes, for example, π1=(0.58,0.42,0,0)T and η1=(1,1,0,0)T as a pair of left and right eigenvectors for the red class and π2=(0,0,0.43,0.57)T and η2=(0,0,1,1)T as a pair of left and right eigenvectors for the blue class. Then, π1 and π2 together span the left eigenspaces of λ = 1 (including all possible stationary distributions), whereas η1 and η2 span the right eigenspaces of λ = 1. It is worth emphasizing that while there is an infinite number of other ways to choose the basis vectors for λ = 1, this is a convenient choice since it is the only one for which all basis vectors have strictly nonnegative entries. For example, consider the choice of π˜1=13π1+23π2=(0.19,0.14,0.29,0.38)T and η˜1=η=(1,1,1,1)T as a first pair of eigenvectors. If we then choose a second pair π˜2 and η˜2 that satisfy biorthogonality (see equation 2.46) within this space, they are guaranteed to contain negative entries, for example, π˜2=π1-π2=(0.58,0.42,-0.43,-0.57)T and η˜2=3η1-32η2=(3,3,-32,-32)T. Thus, the choice of eigenvectors stated in theorem 7 is in some sense special since it is the only one that preserves our intuition that left eigenvectors with λ = 1 correspond to distributions over the state space S. For this reason, we henceforth assume this convention when referring to eigenvectors with λ = 1.

Unfortunately, there is no general analogue of equation 3.3 for reducible chains. One reason for this is that like the λ = 1 case, there are many choices of eigenvectors for λ ≠ 1. However, for a recurrent reducible chain, the transition matrix can be written in block diagonal form (see the proof of proposition 4), which means we can mirror the λ = 1 case by choosing eigenvectors with λ ≠ 1 to have nonzero entries only in a single recurrent class. If the kth recurrent class has n states, then the corresponding block is an n × n matrix, meaning that there are n pairs of left and right eigenvectors with nonzero entries for states in this class—one pair with λ = 1 (πk and ηk) and another n − 1 pairs with λ ≠ 1. We can therefore apply an equivalent argument to equation 3.3 for the kth class, but instead using the vector ηk. Thus, for each recurrent class, the left eigenvectors with λ ≠ 1 can also be chosen such that they all sum to zero. Conversely, nonrecurrent chains cannot be written in block diagonal form, which means that this argument cannot be applied. Therefore, some eigenvectors will not sum to zero for such chains, although to our knowledge, this case has not received significant attention so far in the literature.

Looking at the eigenvalue plot of Figure 13d, we see that there are two eigenvalues with |λ| < 1, both of which are real and negative. Using the terminology from section 2.3, they therefore correspond to transient oscillations of the chain. Furthermore, since the chain is recurrent, we can apply the procedure described above and choose one eigenvector to have nonzero entries only in the red class and the other eigenvector to have nonzero entries only in the blue class. With this choice, we see that each transient oscillation takes place on a distinct communicating class, which we indicate on the plot by coloring the eigenvalues red and blue.

In the case of proposition 6, the extension to reducible chains is straightforward since one can simply apply this result individually to each recurrent communicating class of a reducible chain. Therefore, for each class of period d, there are d eigenvalues of modulus 1 that satisfy the same properties as in the irreducible case.

Finally, a couple of similarities between theorems 7 and 6 can be pointed out. First, in both theorems, λ = 1 is guaranteed to be an eigenvalue. Since every eigenvalue has at least one eigenvector, this means that we can always find a left eigenvector with λ = 1. Provided that we choose an eigenvector with nonnegative entries and normalize it to one, it is a stationary distribution of the chain. This therefore justifies our claim from section 2.2 that every finite Markov chain has at least one stationary distribution. Second, in both theorems, the eigenvalues cannot have absolute value greater than one, which is one of the assumptions we made when studying the evolution of a chain in terms of its eigenvectors and eigenvalues in section 2.3, and which justified our partitioning of equation 2.54 into persistent and transient terms.

The results of this section emerge by treating Markov chains as graphs. However, in most graphs, the outgoing edges from each vertex do not sum up to one, meaning that they cannot be interpreted as transition probabilities. Because of this, Markov chains can be more precisely interpreted as a type of normalized graph. This idea is formalized in the next section, where we introduce a well-known method for transforming any graph G into a Markov chain.

4.1  Definition

Suppose we have a graph G with weight matrix W that we want to normalize such that the outgoing weights from each vertex vi sum up to 1.6 The most obvious way to do this is to divide all entries in the ith row of W by di+. By scaling each edge weight Wij by the out-degree of the starting vertex vi, we obtain the transition probabilities Pij=Wijdi+. In order to write this in matrix notation, we first define a degree matrixD whose elements are given by
Dij:=di+ifi=j0otherwise,
(4.1)
Since di+>0, we can invert D to produce a diagonal matrix with the entries 1di+ on the diagonals. Using this inverse, W can be row-normalized by multiplying with D-1 from the left:
P:=D-1W.
(4.2)
The Markov chain represented by equation 4.2 is called the random walk on G (Göbel & Jagers, 1974; Lovász, 1993; Vempala, 2005). This is a fitting name, since if we imagine an agent walking between vertices in G and randomly choosing where to go at each time step based on the weights of outgoing edges, then equation 4.2 would describe the resulting Markov chain.

Qualitatively, we can say that the transformation in equation 4.2 is useful when we have a starting graph G that we would like to describe in probabilistic and/or temporal terms. Conversely, if we have a starting Markov chain and transition matrix P, knowing some matrix W for which equation 4.2 holds can offer insight into the type of relationships between states that give rise to the chain. However, the latter of these two perspectives is partially complicated by the fact that the mapping from P to W is one-to-many, and there are in fact an infinite number of different graphs G that produce the same Markov chain. As an example, in Figure 14, we show two distinct weight matrices W1 and W2 that get transformed to the same transition matrix P using equation 4.2. How can we describe the infinite set of graphs corresponding to a single Markov chain? In principle, it involves undoing the row normalization of equation 4.2. Thus, for a given Markov chain X with transition matrix P, we consider all possible scalings of the rows of P by positive constants. Every such scaling can be described by a diagonal matrix AR>0N×N that multiplies P from the left to produce a single corresponding weight matrix, W=AP. As an illustration, in Figures 14d and 14e, we show the two scaling matrices that undo the row normalization of the transition matrix in Figure 14c and transform it back into the weight matrices W1 and W2, respectively. The following definition generalizes this to the set of all such weight matrices that can be realized in this way:

Figure 14:

(a,b) The weight matrices of two graphs G1 and G2 that have edges connecting the same pairs of states but with different weights, (c) the transition matrix of the Markov chain produced via a random walk on either of these graphs, and (d,e) the scaling matrices that transform P into W1 and W2, respectively.

Figure 14:

(a,b) The weight matrices of two graphs G1 and G2 that have edges connecting the same pairs of states but with different weights, (c) the transition matrix of the Markov chain produced via a random walk on either of these graphs, and (d,e) the scaling matrices that transform P into W1 and W2, respectively.

Close modal
Definition 18.
For a Markov chain X defined on a state space with |S|=N states and with a transition matrix P, the following defines the set of all graphs from which the chain can be generated via a random walk:
RW(X)={G:W=AP, where AR>0N×N is a positive diagonal matrix }
(4.3)

which we call the random walk set of X.

A few details are worth noting about definition 18. First, since the trivial scaling A=1 is allowed, PRW(X) for any Markov chain. Second, the fact that AR>0N×N means that each of these matrices is invertible, with A-1 also being diagonal and having entries equal to the reciprocals of the diagonals of A. Consequently, if W1=A1P and W2=A2P are two weight matrices in the random walk set of a given Markov chain, we can always write W1=A1A2-1W2=A3W2, meaning that W1 and W2 are also related simply by a row scaling with positive constants, with A3=A1A2-1 describing this scaling. Therefore, the row scaling defined in equation 4.3 effectively partitions the set of all nonnegative matrices into equivalence classes. Finally, since definition 18 allows any nonzero scaling of the rows of P, the random walk set of any Markov chain predominantly consists of graphs that are neither undirected nor balanced. In fact, only certain types of Markov chains have random walk sets that contain undirected or balanced graphs, as explained by the following two results:

Theorem 8.

A Markov chain X is recurrent if and only if RW(X) contains balanced graphs (for the proof, see the appendix).

Theorem 9.

A recurrent Markov chain X is reversible if and only if the balanced graphs in RW(X) are undirected (for the proof, see the appendix).

As an illustration, in Figures 15a15c, we show the random walk sets for the three Markov chains that were studied in Figures 6a, 6d, and 6j of section 2.6.7 In each case, the Markov chain is located in the center and is colored in black. Other graphs in the random walk sets are colored based on the graph type (undirected = green, balanced directed = blue, unbalanced = red), and are depicted as miniature graphs without edge weights (except for one representative example of each type). The Markov chains of Figures 15a and 15b are both recurrent, and we indeed see that their random walk sets contain balanced graphs (see theorem 9). In both figures, notice that more unbalanced graphs are drawn to reflect the fact that they are more numerous than the balanced cases. Furthermore, the chain in Figure 15a is reversible, whereas the chain in Figure 15b is nonreversible, meaning that in the former case, all balanced graphs in RW(X) are undirected, and in the latter case, they are directed (see theorem 9). The Markov chain in Figure 15c is nonrecurrent, and in agreement with theorem 8, it contains only unbalanced graphs. Finally, note that we do not include a corresponding diagram for the semireversible example in Figure 6g. However, since such chains can be made reversible by removing nonrecurrent states, a simple extension of theorem 9 ensures that for such chains, there exist graphs in RW(X) for which the edges between recurrent states are undirected.

Figure 15:

(a–c) The random walk sets of the three Markov chains in Figures 6a, 6d, and 6j, including undirected (green), balanced directed (blue), and unbalanced (red) graphs, as well as the chains themselves (black), which are reversible (a), recurrent and nonreversible (b), and nonrecurrent (c). Venn diagrams in panels d and e illustrate the relationships between the three types of graph, and the three types of Markov chain, respectively. In panel d the colors indicate the type of graph, whereas in panel e, they indicate the types of graphs that exist in the random walk sets of each type of Markov chain. For both Venn diagrams, the color scheme is in accordance with panels a to c.

Figure 15:

(a–c) The random walk sets of the three Markov chains in Figures 6a, 6d, and 6j, including undirected (green), balanced directed (blue), and unbalanced (red) graphs, as well as the chains themselves (black), which are reversible (a), recurrent and nonreversible (b), and nonrecurrent (c). Venn diagrams in panels d and e illustrate the relationships between the three types of graph, and the three types of Markov chain, respectively. In panel d the colors indicate the type of graph, whereas in panel e, they indicate the types of graphs that exist in the random walk sets of each type of Markov chain. For both Venn diagrams, the color scheme is in accordance with panels a to c.

Close modal
Figure 15:

Continued.

To summarize these observations, in Figures 15d and 15e we show two Venn diagrams that illustrate the relationships between the different types of Markov chains and graphs considered. In Figure 15d, graphs are shown as an outer circle, with balanced graphs as a particular case, and undirected graphs as a special type of balanced graph, graphs ⊃ balanced graphs ⊃ undirected graphs (colored red, blue and green, respectively). In Figure 15e, Markov chains are organized in a similar way, Markov chains ⊃ recurrent chains ⊃ reversible chains. Moreover, the colors in the Markov chain diagram are based on the types of graphs allowed in RW(X) for each type of chain. For example, reversible chains are shaded in red and green since they correspond to random walks on either undirected or unbalanced graphs.

The balanced graphs belonging to a recurrent Markov chain’s random walk set are in some sense special, since the vertex degrees have a simple relationship to the stationary probabilities:

Proposition 7.
Let X be a recurrent Markov chain and G one of the balanced graphs in RW(X). Then, the degrees of this graph are related to one of the stationary distributions π>0 of X, via
diz=πi,
(4.4)
where z = ∑idi is the volume of G (for the proof, see the appendix).

For example, evaluating diz for v1 in the balanced graphs in Figures 15a and 15b yields 20480.417 and 2907140.406, respectively, which are indeed the stationary probabilities of state s1 for each of the corresponding Markov chains (see Figures 6b and 6e). This highlights something useful about balanced graphs, which is that the weight matrix W allows direct calculation of one of the stationary distributions without having to simulate the random walk. Conversely, for an unbalanced graph, there is no universally valid expression relating stationary probabilities of the random walk to the vertex degrees.

Perhaps the most important conclusion to draw from this section is that one can always describe a reversible Markov chain as a random walk on some undirected graph. Since undirected graphs have symmetric weight matrices and since matrices of this type have received a large amount of study in mathematics, this interpretation provides a number of tools for describing reversible chains in more detail. This is the focus of the next section. Directed graphs are as of yet far less understood, meaning that the same level of description for nonreversible chains is not possible. However, in section 4.3, we explore some cases where concepts can be extended to the directed/non-reversible case. Since balanced directed graphs are less common objects in graph theory, we do not dedicate a section to them and instead consider them briefly as a special case in section 4.3.

4.2  Random Walks on Undirected Graphs

4.2.1  Relationship to Symmetric Matrices

In this section, we explore in more depth the connections between real symmetric matrices and the transition matrices of reversible chains. We start by providing the following two results for real symmetric matrices (Meyer, 2000):

Theorem 10.
A real matrix ARn×n is symmetric if and only if
Ax,x'=x,Ax'x,x'Rn,
(4.5)
where 〈 ·, ·〉 denotes the standard Euclidean inner product.
Theorem 11.
A real matrix ARn×n is symmetric if and only if it is orthogonally diagonalizable, meaning that there exists an orthogonal matrix Y for which
Δ=Y-1AY
(4.6)
=YTAY,
(4.7)
where Δ is a diagonal matrix.

A couple of details can be pointed out about theorem 11. First, by comparing equation 4.6 to our analysis of section 2.3, we see that the columns of Y are a basis of right eigenvectors of A and the rows of YT are the corresponding dual basis of left eigenvectors. Second, both of these bases are orthonormal since Y is orthogonal. Third, the matrix Δ contains the eigenvalues of A, which are guaranteed to be real since all other matrices in equations 4.6 and 4.7 are real. Finally, it is worth emphasizing the existential condition of theorem 11, since not all choices of eigenvectors of a symmetric matrix obey this result. On one hand, equation 4.6 requires that the sets of left and right eigenvectors are chosen together to be a biorthogonal system. On the other hand, even if we assume this property, when a symmetric matrix has repeated eigenvalues, the corresponding eigenvectors can be nonorthogonal. However, even in this case, it is always possible to apply the Gramm-Schmidt procedure to make them orthonormal, and provided that we find such a basis, we can relate the left and right eigenvectors in the following way:

Proposition 8.

Let A be a real symmetric matrix. Then, if Y is an orthonormal basis of right eigenvectors of A and YT is the corresponding dual basis, for each eigenvalue λω the left and right eigenvectors are equal: lω=rω.

Clearly, for any Markov chain, reversible or otherwise, the transition matrix is itself rarely symmetric. Therefore, the results above do not directly apply to P. However, reversible Markov chains can be related in a number of ways to symmetric matrices, and by virtue of these relations, they satisfy variants of the results above (Brémaud, 1999). For example, from section 2.6, we know that any flow matrix of a reversible Markov is symmetric. This fact allows us to establish the following analogue of theorem 10:

Theorem 12.
Let X be a Markov chain with transition matrix P. Then X is reversible if and only if there exists a stationary distribution π>0, such that
x,Px'Π=Px,x'Πx,x'Rn,
(4.8)
where
a,bΠ:=aTΠb=iπiaibi
(4.9)
defines an inner product normalized by the stationary probabilities πi (for a proof, see the appendix).

Furthermore, theorem 9 tells us that there exist certain row scalings that transform the transition matrix P of a reversible Markov chain into symmetric matrices. In fact, this is not the only scaling operation that transforms P into a matrix of this type. The next result shows that such a matrix is also formed when we scale both the rows and columns of P as follows:

Theorem 13.
Let X be a Markov chain with transition matrix P. Then X is reversible if and only if for any stationary distribution π>0,
K:=Π12PΠ-12
(4.10)
is a symmetric matrix. Furthermore, regardless of which stationary distribution is used, the matrix K is unique (for the proof, see the appendix).

The first thing to note about theorem 13 is that the output of the scaling operation is somewhat different from that in theorem 9, since in the former case, there is only a single symmetric matrix, whereas in the latter there are an infinite number. Additionally, equation 4.10 implies that P and K are similar matrices (section 2.3). Since similar matrices have the same eigenvalues and related sets of eigenvectors, we can use this to establish the following generalizations of theorem 11 and proposition 8:

Theorem 14.

Let X be a reversible Markov chain with transition matrix P. Then X is reversible if and only if it is diagonalizable with real eigenvalues, and there exists a basis of right eigenvectors that are orthogonal with regard to ·,·Π and a corresponding dual basis of left eigenvectors that are orthogonal with regard to ·,·Π-1, where π>0 is a stationary distribution of the chain (for the proof, see the appendix).

Proposition 9.

Let X be a reversible Markov chain with transition matrix P and π>0 one of its stationary distributions. Furthermore, let YR be a set of right eigenvectors of P and YL its dual basis, both of which obey the orthogonality relations of theorem 14. Then, if rω and lω are right and left eigenvectors of the same eigenvalue λω, they are related via lω=Πrω (for the proof, see the appendix).

Theorem 14 is particularly important for practical reasons. First, the diagonalizability of P implies a full set of linearly independent eigenvectors. Linearly independent feature spaces are often desirable from a computational perspective since they (1) reduce the overall redundancy, (2) can express any function in RN, and (3) ensure that certain matrix operations are well defined. Furthermore, as already explained in section 2, having a diagonalizable transition matrix means that evolving a Markov chain becomes computationally cheaper. Second, since P is a real matrix with real eigenvalues, we can always choose an eigenbasis consisting only of real valued vectors. This property is useful because real spaces are often more intuitive to deal with than complex spaces. Furthermore, in many applications of Markov chains, the underlying vector space is required to be real, either due to the semantic nature of the problem or because the algorithm being used is not suited to complex spaces.

While a linearly independent set of eigenvectors is useful, having a basis that is pairwise orthogonal with regard to the standard Euclidean product offers further analytical and numerical benefits. From theorem 14, it is clear that this is not the case for transition matrices of reversible Markov chains. However, the matrix K is a normalized symmetric version of P, and from the proof of theorem 14, we know that the eigenvalues of these two matrices are the same and their eigenvectors are related simply by a multiplication of Π±12. Thus, they contain similar information regarding the relationship between pairs of states in S, and so K can be used as a surrogate for P in situations where orthogonal eigenvectors are required.

When using the matrix K, it is sometimes useful to express it purely in terms related to a graph G. Starting with equation 4.10, this can be done by swapping stationary probabilities for degree vertices:
K=(4.10,4.4)D12z-12Pz12D-12
(4.11)
=D12PD-12
(4.12)
=(4.2)D-12WD-12.
(4.13)

This expression appears in the next section, where we use K to define a positive semidefinite matrix that has the same eigenvectors.

Collectively, the results of this section demonstrate that transition matrices of reversible chains satisfy similar properties to symmetric matrices, but subject to a different type of normalization. This alternative normalization is important for the next section, and in order to make our analyses there more concise, we end this section by defining the following two coordinate transformations (Liu, 2011):

Definition 19
(Left and Right Coordinate Transformations). Let G be an undirected graph with N vertices and degree matrix D. If xRN is a vector defined over these vertices, we define the following related vectors:
xL=D12x,
(4.14)
xR=D-12x,
(4.15)
which we call the left and right coordinate transformations of x, respectively.

4.2.2  Normalized Graph Laplacian

A symmetric matrix ARN×N is positive semidefinite if xTAx0xRN, or equivalently if all its eigenvalues are nonnegative. Such matrices have a number of numerical properties that make them useful for solving optimization problems, and because of this, they often appear in computational applications. The matrix K is not positive-semidefinite, since it has the same eigenvalues as P (which can be negative). However, it is straightforward to define a variant of K that does have this property, while also having the same eigenvectors as K:

Definition 20.
For an undirected graph G with weight matrix W, the normalized graph Laplacian is given by
L:=1-K
(4.16)
=(4.12)1-D12PD-12
(4.17)
=(4.13)1-D-12WD-12
(4.18)
with elements equal to
Lij=1-Wiidiifi=j-Wijdidjifijand(vi,vj)E0otherwise
(4.19)
and for which the following properties hold:
  • L is symmetric and positive semidefinite, with eigenvalues in the interval [0, 2].

  • λ = 0 is guaranteed to be an eigenvalue, and its multiplicity is equal to the number of connected components in G.

  • If C is a connected component of G, then there is an eigenvector yC with eigenvalue λ = 0 whose entries are di12 for viC and 0 otherwise, or, equivalently, yC=D121C=1C,L, where 1C is an indicator vector for vertices in C.

  • Therefore, for fully connected graphs, y=D121=1L is the unique eigenvector with λ = 0, where 1 is a vector of ones.

There exist other types of graph Laplacians for undirected graphs, such as the unnormalized Laplacian L:=D-W and the random walk Laplacian LRW:=1-D-1W. Many useful properties of a graph G can be obtained from graph Laplacians, and they form the basis of spectral graph theory (Chung, 1997). Due to its close relationship to P, in this tutorial, we predominantly focus on L. In the case of L, there is a weaker connection to P, which we only discuss briefly. Furthermore, to avoid redundancy, we do not discuss LRW at all, since it is the same as -P but shifted by the identity. For a concise review of each of these objects, see Wiskott and Schönfeld (2019).

As the name suggests, graph Laplacians are analogues of the Laplace operator. In particular, a number of studies have found links between graph Laplacians and a variant of the Laplace operator on manifolds, known as the Laplace-Beltrami operator (Belkin & Niyogi, 2008; Hein et al., 2007). Broadly speaking, Laplace operators measure how much the value of a function at a point varies from its local average, which is informally related to the notion of mean curvature (Reilly, 1982). Furthermore, the eigenvalues of this operator are nonnegative real numbers and are related to how much the corresponding eigenfunctions vary over the domain of the function (Grebenkov & Nguyen, 2013). All of these properties have analogues in the case of L, which justifies the term given to this object. To see this, we start with the following result (von Luxburg, 2007):

Theorem 15.
Let G be an undirected graph with weight matrix W and normalized Laplacian L. Then for any vector x defining values on the vertices of G,
xTLx=12i,j=1NWijxidi12-xjdj122,
(4.20)

which is guaranteed to produce a real nonnegative number since L is positive semidefinite (for the proof, see the appendix).

In equation 4.20, the quadratic term measures the difference between the entries of a vector that is related to x by normalization with D-12, that is, the vector xR. Since this is minimized when the entries of this vector are similar for pairs of vertices with large edge weight Wij, it can be interpreted to describe the smoothness of xR with respect to the connectivity of G. Furthermore, since this measure describes the smoothness of xR as opposed to that of x, it is useful for dealing with graphs that have a very heterogeneous degree structure.

If we evaluate the left-hand side of theorem 15 using a normalized eigenvector yω of L, we get
yωTLyω=λωyω,yω
(4.21)
=λω
(4.22)
meaning that λω describes precisely this notion of smoothness for yω. Now assume that we have an orthonormal set of eigenvectors of L. From definition 20, we know that the corresponding eigenvalues are real numbers in the interval [0, 2] and that there is guaranteed to be at least one eigenvalue λ = 0. Thus, they can be ordered as follows:
0=λ1λ2λN2.
(4.23)
Putting these findings together, we can say that given an orthonormal set of eigenvectors of L, there is a natural way to order them based on their smoothness across G, according to equation 4.20. For example, taking an eigenvector with λ = 0 (see definition 20) and plugging it into equation 4.20 indeed yields zero. These are therefore the eigenvectors of L that have the lowest smoothness score in equation 4.20. Similarly, the eigenvectors of L with the second smallest eigenvalue are those that, subject to orthonormality, score the second lowest in equation 4.20.
As indicated by equation 4.23, we can apply this ordering to all eigenvalues of L, meaning that the resulting set of eigenvectors gets progressively less smooth, or equivalently more oscillatory, as λω increases. Since this basis is orthonormal, it can express any signal xRN on G, with the resulting components describing the projection of x onto each of these oscillatory modes. Due to its correspondence with Fourier analysis, this decomposition has been referred to as the graph Fourier transform of x (Shuman et al., 2013). Moreover, a useful result in spectral graph theory is that a basis of eigenvectors corresponding to the k leading eigenvalues of L is one that, subject to orthonormality, has the lowest possible total score in equation 4.20. In other words, if YRN×k represents such a basis of right eigenvectors, it solves the following optimization problem (Belkin & Niyogi, 2001, 2003; Wiskott & Schönfeld, 2019):
minimizetr(YTLY)
(4.24)
subjecttoYTY=1.
(4.25)
This property underlies several Laplacian-based methods in unsupervised learning (Ghojogh et al., 2021; Sprekeler, 2011), where typically the eigenvectors of L with the smallest eigenvalues are those that capture the most important information relating to the learning objective, and so discarding eigenvectors with higher eigenvalues can provide a description of a data set that has lower dimensionality while also being more interpretable.

These observations can be related to the random walk on G. Since L shares an eigenbasis with K and the eigenbasis of K is related to that of P by theorem 13, we can relate the eigenvectors of L to those of P as follows:

Proposition 10.

If yω is an eigenvector of L with eigenvalue λω, then lω=D12yω=yω,L and rω=D-12yω=yω,R are left and right eigenvectors of P, respectively, with eigenvalue 1 − λω.

Because of this, the eigenvectors of P have a similar form to those of L, but subject to the coordinate transformations of definition 19. Therefore, they can also be interpreted using the notion of smoothness, with λ = 1 in some sense being the smoothest case.

To illustrate the relationship between the eigenvectors of L and P, we consider a simple Markov chain consisting of a linear arrangement of 100 states, where only nearest neighbor transitions are allowed (see Figure 16a). Transitions to the left and right occur with probability 0.48 and 0.52, respectively, and the self loops at s1 and s100 mean that staying fixed is allowed in these states. Take a moment to verify that this chain is both ergodic and reversible (hint: in the latter case, try theorem 5). Once we know the stationary probabilities πi of this chain, we can calculate the corresponding matrices K and L by using equation 4.10 and then equation 4.16. In Figure 16b, we plot the left and right eigenvector of P with λ = 1, which are the stationary distribution π and η=(1,1,...,1)T, respectively. The stationary probabilities are larger for states near the right end of the chain by virtue of the tendency for rightward transitions in this chain. Additionally, we plot the corresponding eigenvector of L with λ = 0, which is y=(π112,π212,...,π10012)T. This vector is in some sense an intermediary between π and η since, in agreement with proposition 10, π=Π12y=D12y=yL and η=Π-12y=D-12y=yR (as indicated by the black arrows in Figure 16b).

Figure 16:

(a) A Markov chain with N = 100 states arranged in a line, with left and right transitions happening with probability 0.48 and 0.52, respectively. (b) The eigenvector of L with λ = 0 (blue), as well as the left (red) and right (green) eigenvectors of P for this model. (c) The next four eigenvectors of L, with the corresponding eigenvectors indicated in the legend, as well as similar plots for the next four left (d) and right (e) eigenvectors of P. All eigenvalues are rounded to 4 decimal points.

Figure 16:

(a) A Markov chain with N = 100 states arranged in a line, with left and right transitions happening with probability 0.48 and 0.52, respectively. (b) The eigenvector of L with λ = 0 (blue), as well as the left (red) and right (green) eigenvectors of P for this model. (c) The next four eigenvectors of L, with the corresponding eigenvectors indicated in the legend, as well as similar plots for the next four left (d) and right (e) eigenvectors of P. All eigenvalues are rounded to 4 decimal points.

Close modal

In Figure 16c, we show four more eigenvectors of L having eigenvalues closest to 0. In agreement with our discussion, the eigenvectors get less smooth as λ increases, becoming more oscillatory and resembling trigonometric functions over the state space. In Figures 16d and 16e, we do the same for the left and right eigenvectors of P, respectively. The smoothness of these eigenvectors also depends on the size of λ; however, this time we are interested in those with eigenvalues closest to 1, and the smoothness decreases as λ decreases. Furthermore, in comparison to Figure 16c, these eigenvectors have an additional weighting effect across the state space. The left eigenvectors have larger amplitudes for states on the right-hand side, which is intuitive since these states have higher stationary probabilities πi, and the eigenvectors can be obtained from those of L by the left coordinate transformation of definition 19. For the right eigenvectors, the weighting is the opposite, with states on the left-hand side having larger amplitudes. This is explained by an equivalent argument, except that this time, we apply the right coordinate transformation to the eigenvectors of L. It should be noted that for the purpose of visualization, all eigenvectors in Figures 16c to 16e are normalized to have Euclidean norm 1, even though for the eigenvectors of P, this is not the natural normalization (see theorem 14).

The example chain in Figure 16a is somewhat artificial since the transition probabilities are homogeneous, which is why the stationary distribution in Figure 16b appears to be so smooth visually. This in turn means that the underlying graph G has a degree structure that varies somewhat smoothly from left to right, which is partly responsible for the other eigenfunctions also being so smooth. To see how this generalizes as we make the transition probabilities less homogeneous, we consider a modified chain, which we show in Figure 17a. In this model, left transitions from a state si happen with probability 0.52 + ξi and right transitions with probability 0.48 − ξi, where ξi is a number sampled from a uniform distribution on the interval [−0.1, 0.1]. Therefore, this chain is like the previous example, but with a random perturbation at each state. In Figures 17b to 17e, we show the corresponding versions of Figures 16b to 16e. As one can see, the eigenvectors of L as well as the left eigenvectors of P have similar shapes to the previous case, but they appear more rugged due to the inhomogeneity of the transition probabilities. Given that the perturbations made to the previous model are small, this is an illustration of the fact that these sets of eigenvectors are very sensitive to local information. Rather interestingly, the same is not true for the right eigenvectors of P: in Figure 17b, the right eigenvector with λ = 1 is the same as in the previous example, and in Figure 17e, the other right eigenvectors are partially distorted but still appear somewhat smooth visually. This can be understood in the following way. Since the right eigenvectors of P are the right transformations of the eigenvectors of L, we can reformulate the optimization problem in equations 4.24 and 4.25 in terms of the basis YR=D-12Y:
minimizetr(YRTD12LD12YR)=tr(YRTLYR)
(4.26)
subjecttoYRTD12D12YR=YRTDYR=1,
(4.27)
where we have used the fact that D12LD12=D12(1-D-12WD-12)D12=D-W=L. Therefore, for right eigenvectors of P, the relevant smoothness measure is related to L as opposed to L. Furthermore, writing equation 4.20 in terms of L and xR gives
xRTLxR=12i,j=1NWij(xR,i-xR,j)2,
(4.28)
which we can interpret to mean that the vectors that optimize this measure—the right eigenvectors of P with eigenvalue close to 1—have entries that are close for vertices with large Wij. In the context of the example in Figure 17, this means that such eigenvectors have similar values for neighboring vertices, and a quick check reveals that this is indeed the case in Figure 17d. Conversely, for the eigenvectors in Figures 17b and 17c the values on average vary more between neighboring vertices.
Figure 17:

(a) A modified version of the chain in Figure 16a, whereby the transition probabilities of 0.48 and 0.52 are perturbed by random numbers ξi ∈ [−0.1, 0.1]. (c–e) The equivalent of the plots in Figures 16c to 16e, where again all eigenvalues are rounded to 4 decimal points.

Figure 17:

(a) A modified version of the chain in Figure 16a, whereby the transition probabilities of 0.48 and 0.52 are perturbed by random numbers ξi ∈ [−0.1, 0.1]. (c–e) The equivalent of the plots in Figures 16c to 16e, where again all eigenvalues are rounded to 4 decimal points.

Close modal

Finally, we note that the ordering of eigenvalues used in this section is somewhat different from that in section 2.3. In this section, the eigenvalues of L are ordered from 0 up to 2, which corresponds to the eigenvalues of P being ordered from 1 to −1. To reflect our interpretation, we call this ordering by smoothness. In section 2, however, the eigenvalues of a transition matrix are ordered by their absolute value, which describes how long the contribution from the corresponding eigenvector persists as the chain evolves. We therefore call this choice ordering by persistence. The suitability of either of these types of ordering depends on the specific problem domain in which a Markov chain is being used. For example, in the case of spectral clustering, they correspond to distinct objectives, as demonstrated in Liu (2011). Furthermore, these two types of ordering appear in Creutzig and Sprekeler (2008), in which the authors compare slowness and predictability as objectives for dimensionality reduction of time series data.

This concludes our treatment of random walks on undirected graphs. As a summary of the results given, in Table 2 we compare the mathematical properties and relationships of L, K, and P. The material presented in this section is particularly important for applications in machine learning and data mining in which the data set can be formulated as a graph. In particular, it underlies work that has been done on problems such as spectral clustering (Meilă & Shi, 2000, 2001; Tishby & Slonim, 2001; Saerens et al., 2004; Liu, 2011; Weinan et al., 2008), manifold learning/graph embedding (Coifman et al., 2005a; Coifman & Lafon, 2006), graph-based classification (Kamvar et al., 2003; Szummer & Jaakkola, 2001; Joachims, 2003), and value function approximation in reinforcement learning (Mahadevan, 2005; Mahadevan & Maggioni, 2007; Petrik, 2007; Stachenfeld et al., 2014, 2017). In the next section, we consider how, if at all, the material presented in this section generalizes to directed graphs.

Table 2:

A Summary of the Properties Established for the Three Matrices Associated to Random Walks on Undirected Graphs: P, K, and L.

PKL
Relationship to W D-1W D-12WD-12 1-D-12WD-12 
Diagonalizable ✔ ✔ ✔ 
Symmetric ✗ ✔ ✔ 
Positive semidefinite ✗ ✗ ✔ 
Eigenvalues λω ∈ [−1, 1] λω ∈ [−1, 1] λω ∈ [0, 2] 
Eigenvectors lin. indep. orthogonal orthogonal 
Left eigenvectors yω,L yω yω 
Right eigenvectors yω,R yω yω 
PKL
Relationship to W D-1W D-12WD-12 1-D-12WD-12 
Diagonalizable ✔ ✔ ✔ 
Symmetric ✗ ✔ ✔ 
Positive semidefinite ✗ ✗ ✔ 
Eigenvalues λω ∈ [−1, 1] λω ∈ [−1, 1] λω ∈ [0, 2] 
Eigenvectors lin. indep. orthogonal orthogonal 
Left eigenvectors yω,L yω yω 
Right eigenvectors yω,R yω yω 

4.3  Random Walks on Directed Graphs

Broadening our consideration to directed graphs is necessary if we want to describe nonreversible Markov chains. However, since many of the guarantees established in section 4.2 do not hold for directed graphs, this case is a lot harder to treat analytically. In section 4.3.1 we explore the main challenges that occur when applying spectral graph theory to the transition matrices of nonreversible Markov chains, and in section 4.3.2 we describe methods for circumventing these issues. Finally, in section 4.3.3 we define a generalization of L to directed graphs, and in section 4.3.4 we present a method for enforcing ergodicity on random walks of directed graphs.

4.3.1  Key Difficulties

Since many of the guarantees established in section 4.2 do not hold for directed graphs, they are a lot harder to treat analytically. Perhaps most important, the transition matrices of nonreversible chains are neither guaranteed to be diagonalizable nor to have real eigenvalues (Weber, 2017), which can be observed even for simple cases such as those shown in Figures 18a18c. There are nonetheless some cases for which both properties hold (Weber, 2017) (as shown by the example in Figure 18d), but it is still not fully understood to what degree, if at all, the transition structure of a nonreversible chain determines either the diagonalizability of P or whether its eigenvalues are real or complex.8 When P is nondiagonalizable, it does not have a set of N linearly independent eigenvectors, which can cause numerical issues since in this case, some matrix operations are computationally more expensive or not well defined. Moreover, if the eigenvalues of P are complex, then so are its eigenvectors. As in the real case, we can choose to order these eigenvectors based on persistence or smoothness. In the former case, the generalization is somewhat straightforward since |λ| still describes how long each eigenvector typically persists. In the latter case, however, the question of how to generalize the concept of smoothness to complex eigenvectors is nontrivial and is still an actively researched topic in the literature (Sevi et al., 2023; Marques et al., 2020). These factors make analyzing the transition matrices of nonreversible Markov chains more challenging than in the reversible case.

Figure 18:

(a–d) Four irreducible nonreversible Markov chains (top) that have transition matrices (bottom) with different properties.

Figure 18:

(a–d) Four irreducible nonreversible Markov chains (top) that have transition matrices (bottom) with different properties.

Close modal

4.3.2  Alternative Methods

One general technique for treating a nondiagonalizable matrix X is to add a perturbation so that it becomes diagonalizable. This is based on the notion that diagonalizable matrices densely fill the set of all matrices (Golub & Van Loan, 2013), meaning that it is always possible to find some nearby matrix X' that is diagonalizable. Pauwelyn and Guerry (2021) develop a method along these lines for dealing with nondiagonalizable transition matrices. In particular, for a starting transition matrix P, a perturbation matrix E is found such that P'=P+E preserves a number of the spectral properties of P and is diagonalizable. However, two limitations of this method are that it has computational complexity O(N8) for N × N matrices and the resulting transition matrix can still have complex eigenvalues.

Other lines of work have attempted to circumvent these issues by using alternative matrix decompositions, with a prominent example being the real Schur decomposition (Stewart, 1994; Conrad et al., 2016; Weber, 2017; Fackeldey et al., 2018; Ghosh & Bellemare, 2020).9 This decomposition provides a set of real orthogonal basis vectors, known as Schur vectors, that spans the eigenspaces of X (Golub & Van Loan, 2013). This basis is not unique and corresponds to some ordering of the eigenvalues of X, with the first k Schur vectors spanning the eigenspaces of the first k eigenvectors in this ordering. Therefore, given some Schur decomposition of X, if Uk=(u1,u2,...,uk) is the set of first k Schur vectors, then for any linear combination u˜=γ=1kcγuγ, it is guaranteed that Xu˜span(Uk). For this reason, Uk is said to be an invariant subspace of X (Golub & Van Loan, 2013), and when k < < N it provides a low-dimensional description of the transformation that X represents. The real Schur decomposition is therefore a useful alternative to the eigendecomposition. However, so that the basis captures the most important information about X, a reordering algorithm is needed to specify which eigenspaces of X it should span, and various methods for this have been developed (Ng & Parlett, 1987; Dongarra et al., 1992; Bai & Demmel, 1993; Granat et al., 2009; Brandts, 2002). Furthermore, it is worth emphasizing that in contrast to the eigendecomposition, the real Schur decomposition is guaranteed to exist for any real square matrix X, meaning that it sidesteps the issues of nondiagonalizability and complex feature spaces that can occur with transition matrices of nonreversible Markov chains. In the field of machine learning, the real Schur decomposition of transition matrices has been used as a tool for clustering (Fackeldey et al., 2018) as well as for building state representations in reinforcement learning (Ghosh & Bellemare, 2020).

4.3.3  Directed Normalized Graph Laplacian

In section 4.2.2, the normalized graph Laplacian L was introduced as a way to get a more precise description of the left and right eigenvectors belonging to transition matrices of reversible Markov chains. Generalizing L to directed graphs is challenging since two of its defining features are that it is symmetric and positive semidefinite, neither of which can be satisfied by equation 4.18 if W is nonsymmetric. However, various definitions for directed graphs exist, and while some loosen the constraint that L should be positive semidefinite (Agaev & Chebotarev, 2005; Caughman & Veerman, 2006; Li & Zhang, 2012; Singh et al., 2016), others strictly enforce this via a type of symmetrization (Chung, 2005). We here focus on the latter type and demonstrate connections that this has to some of the material in section 2.

Perhaps the simplest method along these lines is to symmetrize the weight matrix W of a directed graph G to get an alternative weight matrix W sym , for example W sym =W+WT or W sym =WTW, and then use the regular definition of the normalized Laplacian using this new matrix: 1-D-12W sym D-12. Since W sym describes an undirected graph, the resulting object can be interpreted in the same way as section 4.2.2. However, a major drawback of this approach is that the graphs described by W and W sym can have very different structural properties. For instance, there is no guarantee that the random walks on these two graphs have stationary distributions that bear any resemblance to one another. Indeed, various studies in machine learning have indicated that symmetrizing W leads to a significant erasure of structural information from a directed graph (Pentney & Meilă, 2005; Mahadevan et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007).

A more principled approach is the one presented in Chung (2005), in which the directed normalized graph Laplacian is defined as
Ldir:=1-Π12PΠ-12+Π-12PTΠ122,
(4.29)
where P and Π are defined as usual. In the original paper, restrictions are placed on the underlying graph so that P is ergodic. Among other things, this means that the stationary probabilities are all nonzero, which is needed for Π-12 to be well defined. In the next section, we present a method for enforcing this property for any directed graph G, but for now, we simply assume that π>0. Equation 4.29 can be simplified as
Ldir=1-Π12PΠ-12+Π12(Π-1PTΠ)Π-122
(4.30)
=1-Π12P+Π-1PTΠ2=PAΠ-12,
(4.31)
where PA is the additive reversibilization of the random walk (see definition 14). Thus, by comparing equation 4.31 to equation 4.17, we see that Ldir is a variant of L but where P has been exchanged for PA. Thus, this corresponds to a symmetrization of the stationary flow between states (see equation 2.80). While this transformation still destroys some information about the underlying graph G, it has been observed empirically that this effect is less severe in comparison to methods that symmetrize W itself (Pentney & Meilă, 2005; Mahadevan et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007). For example, note that P and PA describe random walks that have the same stationary distributions (see definition 14).
The only time that symmetrizing W and using PA yields the same result is in the special case of a balanced directed graph. In this case, Π12=z-12D12 (see proposition 7), which allows us to simplify equation 4.31:
Ldir=1-z-12D12D-1W+D-1(D-1W)TD2z12D-12
(4.32)
=1-D12D-1W+D-1WTD-1D2D-12
(4.33)
=1-D12D-1W+D-1WT2D-12
(4.34)
=1-D-12W+WT2D-12,
(4.35)
meaning that Ldir corresponds to an additive symmetrization of W. However, in most practical applications, directed graphs are not balanced, and it is in these cases that using PA preserves more information about a directed graph than simply symmetrizing W.

The directed normalized Laplacian has been used in various contexts of machine learning as a way to generalize methods that are restricted to undirected graphs. It has been applied to problems such as spectral clustering (Meilă & Pentney, 2007; Huang et al., 2006; Liu, 2011), graph embedding (Chen et al., 2007; Perrault-Joncas & Meilă, 2011), graph-based classification (Zhou et al., 2005), and value function approximation in reinforcement learning (Johns & Mahadevan, 2007). In most of these applications, ergodicity is enforced on the random walk described by P, and in the next section, we introduce the standard method for doing this.

4.3.4  Random Surfer Model

As explained in section 2.5, ergodic Markov chains have the useful property that they are guaranteed to converge to a unique stationary distribution, and various reasons were given for why this is desirable in a general context. For directed graphs, it is a particularly beneficial property, since without it, a random walk can get trapped in a small cluster of states, or even a single absorbing state. Because of this, sometimes ergodicity is enforced for random walks on directed graphs (Page et al., 1999; Zhou et al., 2005; Huang et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007). We remind readers that one effect this has is that π>0, meaning that the directed normalized Laplacian is well defined.

The typical method used for enforcing ergodicity can be thought of as a variation on the random walk process. Given a directed graph G with weight matrix W, at each time step, there are two possible outcomes: either a regular random walk is performed with probability α, or the process teleports randomly to any vertex with probability 1 − α. This is known as a random surfer model or teleporting random walk, and it appeared for the first time in the PageRank algorithm (Page et al., 1999). If P and Ptel represent the transition probabilities of the two possible outcomes at each time point, then the overall process is described by the following transition matrix,
P':=αP+(1-α)Ptel,
(4.36)
which is sometimes referred to as a Google matrix (Franceschet, 2011).

It is worth noting that there exist many variants of the random surfer model that differ in the assumptions they make about Ptel (Berkhin, 2005). For example, teleporting transitions can either be uniformly random or biased toward certain vertices through a set of weights. Furthermore, the parameter α ∈ [0, 1] is known as the damping factor and determines how close the process is to a regular random walk. Typically, it is set close to 1, so that the process still accurately reflects the structure of the underlying graph G.

In the case of uniform teleportation, it is straightforward to specify the teleportation probabilities since they are all equal to 1N. Therefore, equation 4.36 becomes
P'=αP+(1-α)1N,
(4.37)
where 1RN×N denotes a matrix of ones. While P represents the random walk on G, we can interpret 1N as the transition matrix of a random walk on a graph GC that has the same number of vertices as G but where each vertex vi is connected to all others, including itself.10 An example is shown in Figure 19, with panel a showing the starting graph G and panel b showing GC. In panels c and d, we show the transition matrices P and Ptel of the random walks on G and GC, respectively. Finally, in panel e, the transition matrix of the overall teleporting random walk with α = 0.85 is shown.
Figure 19:

(a) A directed graph G, (b) the graph GC, (c and d) the corresponding transition matrices P and Ptel, respectively, and (e) the random surfer transition matrix P' for α = 0.85, with transition probabilities accurate to 2 decimal places.

Figure 19:

(a) A directed graph G, (b) the graph GC, (c and d) the corresponding transition matrices P and Ptel, respectively, and (e) the random surfer transition matrix P' for α = 0.85, with transition probabilities accurate to 2 decimal places.

Close modal

Due to the teleportation term Ptel, from any given state si there is always a nonzero probability to access any other state or to stay in the same state. By virtue of this, such processes are guaranteed to be both irreducible and aperiodic, and therefore ergodic.

4.4  Summary

This concludes our treatment of random walks. The material presented in this section forms a useful framework that connects Markov chains and graphs. On the one hand, describing a Markov chain as a process taking place on a graph is a useful interpretation since it provides intuition about underlying relationships between states. Furthermore, it allows one to apply the tool kit of spectral graph theory to Markov chains. On the other hand, graphs by themselves represent only static relationships between entities, and performing a random walk is one way to describe a graph in terms that are dynamic or temporal. Moreover, the fact that a transition matrix can be easily exponentiated (i.e Pk) means that a random walk provides information about a graph G at multiple timescales, which is a property that has been exploited in the field of manifold learning (Coifman et al., 2005a, 2005b; Coifman & Lafon, 2006). In this section, many concepts and results from linear algebra are required, for which we recommend Meyer (2000) as a general resource and Stewart (1994) as a more specific summary of the application to transition matrices. Moreover, we recommend Spielman (2019) as a text on spectral graph theory, where readers can find a more in-depth exploration of graph Laplacians.

While this tutorial focuses on mathematical concepts, the material has a number of applications in machine learning, as well as computer science more generally. A number of these are mentioned in section 1 and various parts of the main text. In order to provide some concrete examples, this section explores two of the most actively researched areas of application.

5.1  Markov Chain Monte Carlo

Monte Carlo (MC) methods are a broad class of computational techniques that use random sampling as a way to make numerical approximations. Research on these methods goes as far back as the 1940s (Bielajew, 2012), and since then, they have been widely used throughout the physical sciences (Amar, 2006), with applications still remaining useful today (Kroese et al., 2014). In Halton (1970), MC methods are generally defined as “representing the solution of a problem as a parameter of a hypothetical population, and using a random sequence of numbers to construct a sample of the population, from which statistical estimates of the parameter can be obtained.” Here the term hypothetical population simply means some distribution μ(x) and the term parameter can be interpreted as the expectation of a well-behaved function ϕ(x) computed over this distribution. Thus, if q is a quantity of interest, then MC methods start by formalizing it as such an expectation (Cemgil, 2014),
q=Eμ[φ(x)],
(5.1)
and then generate repeated samples from the distribution μ(x) (i.e. x1, x2, . . . , xk) with which this expectation can be evaluated empirically. The resulting approximation of the quantity q is then:
q^=φ(x1)+φ(x2)+...+φ(xk)k.
(5.2)
The utility of these methods thus hinges on whether q^ converges to q. Fortunately, for independent and identically distributed samples, the law of large numbers (LLN) guarantees that
Pr(limk|q-q^|=0)=1,
(5.3)
which is known as almost sure convergence (Cemgil, 2014). Moreover, as one approaches this limit, the central limit theorem (CLT) guarantees that the distribution of q^ approaches a normal distribution (Cemgil, 2014),
q^N(q,σ2/k),
(5.4)

where σ2 is the variance of ϕ(x). Therefore, Monte Carlo methods provide an unbiased estimate of q with a standard deviation that scales like O(k-12).

A canonical and simple example of the Monte Carlo method outlined above is the approximation of the constant π = 3.1415. . . . To start, imagine that a unit circle sits inside a square with side length 2 (see Figure 20a) and that rain falls on these shapes, with the position of each drop being determined by a uniform distribution u(x) defined across the total surface. Clearly, the probability of a single raindrop falling inside the circle is given by
pcircle=areaofcircleareaofsquare=π·122·2=π4,
(5.5)
meaning that our quantity of interest can indeed be formulated in a manner akin to equation 5.1,
π=4·pcircle=4·Eu[circ(x)],
(5.6)
where circ(x) is a function that is equal to one when a raindrop falls inside the circle and 0 otherwise. Our next step is to construct an empirical approximation for π in the form of equation 5.2. If we sample k raindrops from the distribution u(x), this approximation reads
π^=4·ncirclek,
(5.7)

where ncircle is the number of raindrops observed to fall inside the circle. Clearly, generating truly uniform rainfall in a physical context is not feasible. However, such a model can be easily simulated with the help of pseudo-random number generators. Figure 20a depicts the result of such a simulation for k = 100, for which the resulting Monte Carlo approximation is π^3.24. While this is not very accurate, equations 5.3 and 5.4 tell us that the approximation should improve on average as k increases. We visualize this in Figure 20b, where k ranges from 1 to 500, and for each value of k, the approximation π^ is carried out 100 times. The blue line shows the mean value of π^ found at k, which gets closer to the true value as k increases, and the gray area shows a single standard deviation, which indeed appears to scale like O(k-12) and indicates that fluctuations from the true value get smaller on average as k increases.

Figure 20:

(a) k = 100 points sampled uniformly across a square area of side length 2 with the unit circle shaded in gray and (b) a line plot showing how the approximation π^ develops as k increases. For each value of k, the mean and standard deviation for 100 trials are shown in blue and gray, respectively, and the true value is shown in red.

Figure 20:

(a) k = 100 points sampled uniformly across a square area of side length 2 with the unit circle shaded in gray and (b) a line plot showing how the approximation π^ develops as k increases. For each value of k, the mean and standard deviation for 100 trials are shown in blue and gray, respectively, and the true value is shown in red.

Close modal

In the example, the Monte Carlo approximation is particularly straightforward because the distribution we needed to sample from was very simple. However, often this is not the case. Indeed, a large portion of Monte Carlo techniques are specifically designed to deal with situations in which sampling directly from the target distribution is difficult or impossible.

In particular, Markov chain Monte Carlo (MCMC) involves the construction of a Markov chain, defined over the sample space of the problem being studied, which is guaranteed to converge to the target distribution. Thus, by initializing such a chain and waiting until it converges, one can eventually generate samples to use for a Monte Carlo approximation. We have already seen that in order to have guarantees of convergence to a unique stationary distribution, a chain needs to be ergodic, and so the goal of MCMC is to find such an ergodic chain for a given stationary distribution π. A wide variety of MCMC methods exist, but by far the most famous is the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970). Below we summarize this algorithm, and to maintain consistency with the rest of the tutorial, we focus on the case where the relevant state space is discrete. However, the ideas presented can be generalized to the continuous case, and indeed, most interesting applications of the algorithm involve continuous state spaces.

At its core, the MH algorithm involves creating a Markov chain based on proposed transitions that either get accepted or rejected and requires the following two ingredients. A proposal distribution for generating transitions must be known a priori, which we denote Pijprop=p(sj|si), and for reasons that become clear below, we assume that the chain corresponding to these transition probabilities is irreducible. The generated transitions are then accepted with the following probability:
aij=min1,πjPjipropπiPijprop.
(5.8)

Two points can be made about the expression above. First, although equation 5.8 involves stationary probabilities, the MH algorithm does not require that π is known explicitly. Instead, it is only assumed that some positive function11 proportional to π is known, that is, π=fZ, where Z is a normalizing constant that is typically unknown or otherwise intractable. This means that ratios of stationary probabilities in equation 5.8 can be evaluated without needing to take care of Z: πjπi=fj/Zfi/Z=fjfi. Second, the denominator on the right-hand side of equation 5.8 is always positive because Pijprop>0 for any proposed transition and πifi > 0 ∀i, meaning that the fraction is always well defined.

In combination, Pijprop and aij induce a resulting Markov chain with the following transition probabilities:
Pijres=Pijpropaij+δijri,
(5.9)
where the second term measures the accumulated probability of remaining in the same state due to rejections, with ri=j'(1-aij')Pij'prop. In order for π to be a stationary distribution of the resulting chain, global balance must hold (see equation 2.18). However, verifying this in most practical settings is prohibitively costly or even impossible. An elegant property of the MH algorithm is that this issue is effectively circumvented, as the resulting chain is guaranteed to satisfy the stronger condition of detailed balance (see equation 2.71). We show this below, considering only the case of ij since detailed balance is guaranteed for i = j:
πiPijres=πiPijpropaij
(5.10)
=πiPijprop·min1,πjPjipropπiPijprop
(5.11)
=minπiPijprop,πjPjiprop
(5.12)
=πjPjiprop·min1,πiPijpropπjPjiprop
(5.13)
=πjPjipropaji
(5.14)
=πjPjires.
(5.15)
Thus, π is guaranteed to satisfy detailed balance and therefore must be a stationary distribution of the chain corresponding to Pres. Moreover, this chain is also ergodic: aperiodicity is enforced by the acceptance rule in equation 5.8, and irreducibility follows since the chain corresponding to Pprop is assumed to be irreducible. Thus, π is the unique stationary distribution that the Markov chain generated by MH is guaranteed to converge to. The beauty of this result is that π need not even be a stationary distribution of the Markov chain corresponding to Pprop, but by following the MH procedure, one is guaranteed to realize samples of π as t → ∞.

Two key challenges facing the application of this algorithm, as well as other MCMC methods more generally, are the following (Johansen & Ludger, 2007). First, the time required for a Markov chain to get close to its stationary distribution, known as the mixing time, can be very long. Because of this, MCMC methods typically involve an initial burn-in period in which samples are discarded. One issue with this procedure is that it is rarely straightforward to judge how long is sufficient for a given target distribution, proposal chain, and initialization. However, in the case of an ergodic and reversible chain, such as in the MH algorithm, some guidance on this can be taken from a well-known result in Markov chain theory, which says that the mixing time for such chains is upper- and lower-bounded by values related to the spectral gap γ = 1 − λ2, where λ2 is the second largest eigenvalue of the associated transition matrix (Levin et al., 2009). Second, samples from a Markov chain that occur close together in time are often highly autocorrelated, which clearly violates the i.i.d. property and reduces the effective sample size. A typical method for treating this, known as thinning, is to use only every mth sample generated from the chain. Something common to both of the issues described above is that they often make MCMC methods very slow in practice. Because of this, techniques for acceleration have received a lot of attention in the literature and are still an active area of research (Robert et al., 2018).

5.2  Reinforcement Learning

Reinforcement learning (RL) is a framework for studying how agents can learn behaviors that maximize reward signals by interacting with their environment. The canonical paradigm for this type of learning are Markov decision processes (MDP), which are based on Markov chains. In this section, we introduce MDPs and outline how the material presented so far can be applied to these models.

MDPs are stochastic control processes, whereby an agent is in a state s at each time point t and chooses an action a from those that are available in s, then finds itself in a state s at time t + 1 and receives a scalar reward rt + 1. Formally, this can be defined as a 5-tuple M=(S,A,p,r,γ), where S is a state space, A is an action space, p(s|s, a) is a transition model describing the probability of moving to s when taking action a in state s, r(s, a) is a reward function describing the instantaneous reward received when taking action a in state s, and γ ∈ [0, 1) a discount factor (Sutton & Barto, 2018). Together p and r define the dynamics of the MDP, and both can be deterministic or stochastic, as long as they respect the Markov property, which requires that rt + 1 and st + 1 depend only on st and at. Furthermore, while in the most general case S, A, and t can be either discrete or continuous, in order to maintain consistency, we here consider the simplest setting where all are discrete.

In RL, an agent’s primary objective is to learn behaviors that maximize reward signals in the environment. The behavior of the agent is described by a policy μ(at|st) that specifies how likely the agent is to make a given action in a given state. The general goal of RL involves interleaving the following: evaluating how well a given policy μ optimizes reward, known as the prediction problem, and finding a new policy μ that improves on μ, known as the control problem. In order to solve either of these problems, one must first define a meaningful measure of reward for a policy μ. Since RL is concerned with sequential forms of behavior, cumulative rewards are more relevant than the instantaneous rewards received at each time point. Therefore, the typical way to measure how much reward a policy μ receives is to consider each state in sS and calculate the cumulative future reward that an agent can expect when it is in state s at time t and following policy μ. This is known as the state-value function and can be written as
vμ(s)=Eμk=0γkrt+k+1|st=s,
(5.16)
where the discount factor γ scales the importance of proximal versus distal future rewards. Equation 5.16 can be applied to all states in S and the values can be combined into a vector vμ, which then measures how well the policy μ optimizes reward over the whole state space. A simple calculation can show that this vector satisfies the following self-consistency equation (Sutton & Barto, 2018),
vμ=rμ+γPμvμ,
(5.17)
which is known as the Bellman equation. The matrix Pμ is the result of combining the policy μ and the environment’s transition model p by marginalizing over actions, Pijμ=aμ(a|si)p(sj|si,a), and describes the Markov chain induced across S when the agent behaves according to μ. The vector rμ involves the same type of marginalization, riμ=aμ(a|si)r(si,a), and its entries describe the instantaneous reward expected in each state s.
Since vμ appears on both sides of equation 5.17, it can further be simplified to
(1-γPμ)vμ=rμ
(5.18)
vμ=(1-γPμ)-1rμ(since(1-γPμ)isinvertible)
(5.19)
=k=0(γPμ)krμ(sinceρ(γPμ)<1)
(5.20)
:=Mμrμ.
(5.21)
The following details are worth pointing out about the equations above. First, since Pμ is a transition matrix, all its eigenvalues have absolute value less than or equal to one, and since γ < 1, this means that all eigenvalues of (1-γPμ) are nonzero, which in turn guarantees that the matrix inverse in equation 5.19 is well defined (see Meyer, 2000, pp. 116, 490). Second, since we can also be sure that all eigenvalues of γPμ are strictly less than one, that is, ρ(γPμ)<1, this inverse can be expressed using the Neumann series in equation 5.20 (see Meyer, 2000, p. 618). Third, an illuminating fact about equation 5.21 is that vμ is now decomposed into two factors relating to the policy μ: one based on the transitions realized and another representing instantaneous rewards. The former term Mμ is known as the successor representation (SR) (Dayan, 1993), and a quick check of definition 17 reveals that it resembles the fundamental matrix N of an absorbing Markov chain, with the major differences being that Mμ takes into account all transitions rather than only those between transient states, and it has an additional discount factor γ.

In this section, we outline a few of the key ways in which Markov chains are important in RL, in particular focusing on the relationships between value functions and transition matrices. One of the assumptions underlying our analysis is that the environment’s transition probabilities and reward function are known a priori. However, in virtually all practical applications, this will not be the case, meaning that vμ must be computed using sampled interactions with the environment. Even in these settings, it is very common that concepts and results from Markov chain theory are relevant, and we recommend Sutton and Barto (2018) as a general text explore this further.

The key motivation of this tutorial is to provide a single introductory text on the spectral theory of Markov chains. By bringing together concepts and results from different areas of mathematics, we hope this work is a useful resource for readers aiming to gain a broad, yet concise, overview of the topic. Our presentation involves two different paradigms for interpreting and analyzing Markov chains. Section 2 presents a categorization based on the transition structure and asymptotic behavior, and section 3 formalizes Markov chains as a type of graph. In section 4, these two perspectives are connected by introducing the idea of a random walk, which provides a number of parallels between some categories of Markov chains and certain types of graphs. In particular, one theme that aligns the two perspectives is the distinction between reversible and nonreversible Markov chains on the one hand, and undirected and directed graphs on the other hand, where in both cases, the former option is easier to treat than the latter. With the additional use of results from linear algebra, we arrive at an in-depth description of the eigenvalues and eigenvectors of transition matrices in the reversible case. Furthermore, we discuss various attempts that have been made to generalize spectral methods to the nonreversible case. Finally, section 5 explores two areas of computer science literature in which various concepts and results from the foregoing sections are used. Although the material mostly consists of known results, two novel contributions are the categorization of eigenvalues given in Table 1 and Figure 3, as well as the notion of random walk sets (see definition 18). Since we only assume minimal exposure to concepts from linear algebra and probability theory, and since focus is placed on providing intuition rather than rigorous results, the material of this tutorial is accessible to researchers and students in a variety of quantitative disciplines. For those working in fields related to machine learning and data mining, this work is particularly relevant due to the applications discussed at various points.

Proof of Proposition 4.
Assume that P is the transition matrix of a reducible recurrent Markov chain with r communicating classes. Therefore, for a suitable indexing of states in S, the transition matrix of this chain has a block diagonal form:
P=P1000P2000Pr,
(A.1)
where Pk is a transition matrix composed of the transition probabilities for states in the kth class. Furthermore, if π>0 is one of its stationary distributions, it can be written as π=k=1rαkπk, where each πk has nonzero entries only in the kth class and αk > 0 (see proposition 3). Thus, given the same indexing of states, the matrices Π and Π-1 have the following forms:
Π=α1Π1000α2Π2000αrΠr
(A.2)
Π-1=(α1Π1)-1000(α2Π2)-1000(α1Πr)-1
(A.3)
where Πk is the diagonal block matrix formed from the nonzero entries of πk. Hence, because P rev is a product of three block diagonal matrices, it too has the same form:
P rev =Π-1PTΠ
(A.4)
=P rev ,1000P rev ,2000P rev ,r
(A.5)
where P rev ,k contains the transition probabilities of states in the kth class for the time-reversed Markov chain. Furthermore, we can easily evaluate these block matrices:
P rev ,k=(αkΠk)-1(Pk)T(αkΠk)
(A.6)
=αkαk(Πk)-1(Pk)T(Πk)
(A.7)
=(Πk)-1(Pk)T(Πk).
(A.8)
Thus, the matrix P rev does not depend on the αk terms that parameterize the stationary distribution, meaning that the time-reversed Markov chain is the same regardless of which distribution is considered.
Proof of Proposition 5.
Assume that P is the transition matrix of a recurrent Markov chain and π>0 is one of its stationary distributions. First note that summing over the jth row or column of the flow matrix Fπ gives the same result:
i=1N(Fπ)ji=i=1NπjPji=πji=1NPji=πj
(A.9)
i=1N(Fπ)ij=i=1NπiPij=(2.19)πj,
(A.10)
0
(A.11)
which in vector notation can be written as πT=1TFπ=1T(Fπ)T, where the symbol 1 denotes a vector of ones. Using this, we see that
πT=πTP,
(A.12)
=1TΠP,
(A.13)
=1TFπ,
(A.14)
=1T(Fπ)T,
(A.15)
=1TPTΠ,
(A.16)
=1TΠΠ-1PTΠ,
(A.17)
=πTΠ-1PTΠ,
(A.18)
=πTPrev,
(A.19)
Therefore, π is a stationary distribution of X if and only if it is a stationary distribution of X˜.
Proof of Theorem 8.

(⇒) Assume that P is the transition matrix of a recurrent Markov chain. Then, for any stationary distribution π>0, the flow matrix Fπ=ΠP corresponds to one of the allowed graphs in RW(X). Therefore, using the same argument given in the proof of proposition 5, the row and column sums of Fπ are the same, meaning that this matrix describes a balanced graph.

(⇐) We adapt this direction of the proof from Banderier and Dobrow (2000), where the unweighted case is considered. Assume that G is a balanced graph with weight matrix W and that X is the Markov chain realized by a random walk on this graph. Now consider the distribution π=1z(d1,d2,...,dN)T, where z=j=1Ndj=vol(G) is needed in order for π to sum to 1. We can then easily verify that the equations of global balance hold for this distribution (see equation 2.19):
i=1NπiPij=(4.2)i=1NπiWijdi=i=1NdizWijdi=i=1NWijz=1zi=1NWij=dj-=djz=πj,
(A.20)

and so πTP=πT, which means that π is a stationary distribution of the chain. Note that the summation in the fourth expression defines the in-degree of vertex vj, but since the graph G is balanced, we only have one degree dj associated with each vertex. Finally, since isolated vertices are not allowed, each degree must be bigger than zero, meaning similarly that πi > 0 siS. Hence, there are no transient states, and X is recurrent.

Proof of Theorem 9.
Assume that X is a recurrent Markov chain and G is one of the balanced graphs in RW(X). Using the same argument from the second part of the proof of theorem 8, the degrees of G are related to a stationary distribution π>0 of the chain via πi=diz. From theorem 4, we know that X is reversible if and only if the flow matrix associated π is symmetric, and it is straightforward to show that this equivalent to G being undirected:
Fπ=(Fπ)T
(A.21)
ΠP=PTΠ
(A.22)
ΠD-1W=WTD-1Π
(A.23)
1zDD-1W=WTD-11zD
(A.24)
W=WT
(A.25)
and since we have chosen G arbitrarily, this applies to any balanced graph in RW(X).
Proof of Proposition 7.

See the second part of the proof of theorem 8.

Proof of Theorem 12.
(⇒) Let x,x'Rn be two arbitrary vectors. Then, if P is the transition matrix of a reversible chain and π>0 is one of its stationary distributions, it is guaranteed that ΠP=PTΠ (see theorem 4). Using this, we can show that P satisfies equation 4.8:
x,Px'Π=(4.9)xTΠPx'
(A.26)
=xTPTΠx'
(A.27)
=(Px)TΠx'
(A.28)
=(4.9)Px,x'Π.
(A.29)
(⇐) Let P be the transition matrix of a reversible Markov chain and π>0 one of its stationary distributions. Equation 4.8 can be written as
xTΠPx'=xTPTΠxx,x'Rn.
(A.30)
If we choose x=ei to be a vector with ith entry equal to 1 and zeros elsewhere, this reduces to
j'πiPij'xj''=j''(PT)ij''πj''xj''',
(A.31)
πij'Pij'xj''=j''Pj''iπj''xj'''.
(A.32)
If we then choose x'=ej to be a vector with jth entry equal to one and zeros elsewhere, we get
πiPij=πjPji.
(A.33)
Since the indices i and j have been chosen arbitrarily, equation A.33 holds between any pair of states. Therefore, the chain is reversible (see theorem 4).
Proof of Theorem 13.
Let X be a reversible Markov chain with transition matrix P and a stationary distribution π>0, and let K=Π12PΠ-12. Then the (i, j)th element of K is
Kij=πiPij1πj
(A.34)
=(2.71)πiPjiπjπi1πj
(A.35)
=πjPji1πi
(A.36)
=Kji,
(A.37)
meaning that K is a symmetric matrix. Note that in the third line, we make use of detailed balance, which holds for π>0 if and only if X is reversible. Therefore, this concludes both directions of the proof.
In order to establish the uniqueness of K, we make use of the fact that a reversible Markov chain must be recurrent, meaning that any stationary distribution π>0 is of the form π=k=1rαkπk with αk > 0 (see proposition 3). Since all entries of π are greater than zero, Kij=πiPij1πj0 iff Pij ≠ 0. For a recurrent chain, the latter can only be true for pairs of states in the same communicating class. If si and sj belong to the kth communicating class, then
Kij=πiPij1πj
(A.38)
=αkπk,iPij1αkπk,j
(A.39)
=πk,iPij1πk,j,
(A.40)
where πk, i denotes the ith component of the stationary distribution associated with the kth class. Hence, the nonzero entries of K do not depend on the values of αk, meaning that they are irrespective of the stationary distribution used.
Proof of Theorem 14.
Let X be a reversible Markov chain with transition matrix P. The proof of theorem 13 establishes that this is equivalent to K=Π12PΠ-12 being symmetric, where π>0 is a stationary distribution of the chain. By virtue of theorem 11, this in turn is equivalent to the existence of an orthogonal matrix Y for which
D=YTKY
(A.41)
=YTΠ12PΠ-12Y.
(A.42)
Equation A.42 says that P has the same eigenvalues as K, which are real. It also tells us that P is diagonalizable by sets of right and left eigenvectors that are related to the eigenvectors of K by Π-12 and Π12, respectively. Hence, if yω is an eigenvector of K with eigenvalue λω, then rω=Π-12yω and lω=Π12yω are a pair of corresponding right and left eigenvectors of P with the same eigenvalue. Using this, we see that if rω and rγ are a pair of right eigenvectors of P, then
rω,rγΠ=Π-12yω,Π-12yγΠ=yωTΠ-12ΠΠ-12=1yγ=yωTyγ=δωγ,
(A.43)
where we have used the fact that the basis Y is orthonormal. Similarly, if lω and lγ are a pair of left eigenvectors of P, then
lω,lγΠ-1=Π12yω,Π12yγΠ-1=yωTΠ12Π-1Π12=1yγ=δωγ.
Proof of Proposition 9.
From the proof of theorem 14, we know that if Y is an orthonormal basis of K and yω is a basis vector with eigenvalue λω, then rω=Π-12yω and lω=Π12yω are right and left eigenvectors of P, respectively, with the same eigenvalue. Therefore:
lω=Π12(Π12rω)=Πrω.
(A.44)
Proof of Theorem 15.
Assume that L is the normalized Laplacian of an undirected graph G and xRN. Then:
xTLx=(4.18)xTx-xTD-12WD-12x
(A.45)
=i=1Nxi2-i,j=1NWijxixjdi12dj12
(A.46)
=12i=1Nxi2+12j=1Nxj2-2i,j=1NWijxixjdi12dj12
(A.47)
=12i=1Nxi2didi=1+12j=1Nxj2djdj=1-2i,j=1NWijxixjdi12dj12
(A.48)
=12i=1Nxi2j=1NWijdi+12j=1Nxj2i=1NWjidj-2i,j=1NWijxixjdi12dj12[-5pt]
(A.49)
=12i,j=1Nxi2Wijdi+12i,j=1Nxj2Wjidj-2i,j=1NWijxixjdi12dj12
(A.50)
=12i,j=1NxiWijdi12-xjWjidj122
(A.51)
=12i,j=1NWijxidi12-xjdj122(Wij=WjisinceGisundirected)[-7pt]
(A.52)

We thank Jonathan Hermon for useful discussions on Markov chain theory, as well as Josué Tonelli-Cueto for his insight into the Perron-Frobenius theorem and various concepts in linear algebra.

1

In component notation, the outer product between two vectors x=(x1,x2,...,xN)T and y=(y1,y2,...,yN)T is xyT:=x1x2xNy1y2yN=x1y1x1y2x1yNx2y1x2y2x2yNxNy1xNy2xNyN.

2

For continuous time Markov chains, this result is known as Kelly’s lemma.

3

We restrict edge weights to be positive in order to maintain this notion of strength; however, it is worth noting that some conventions in graph theory allow negative weights.

4

Since in graph theory unweighted graphs are more commonly studied than weighted graphs, the degree quantities we have defined are sometimes referred to as weighted degrees (Chapman & Mesbahi, 2011), but for simplicity we just use the term degree.

5

We remind readers that the spectral radius of a square matrix X is its largest eigenvalue in absolute value and is denoted ρ(X). The name relates to the fact that all eigenvalues are contained within a disk of radius ρ(X) centered at the origin of the complex plane.

6

For this definition to work, we require that all vertices have di+>0, meaning that isolated vertices or vertices with only incoming edges are forbidden.

7

While the examples here each have a single communicating class, the analysis would apply equally to chains with multiple classes.

8

Although some studies have considered complex eigenspaces in specific settings (Meyn et al., 2008; Andrieux, 2011; Mieghem, 2018; Conrad et al., 2016).

9

It should be noted that most of these studies primarily consider nonreversible chains that are ergodic.

10

Sometimes objects like GC are referred to as complete graphs or fully connected networks. However, such terms typically do not include the possibility of self-loops, which we by definition need since we consider uniform teleportation.

11

Note that some references make the milder assumption of a nonnegative function, but that this can be made positive by removing states for which πi = 0.

Agaev
,
R.
, &
Chebotarev
,
P.
(
2005
).
On the spectra of nonsymmetric Laplacian matrices
.
Linear Algebra and Its Applications
,
399
,
157
168
.
Aggarwal
,
C. C.
(
2015
).
Data mining
.
Springer
.
Aldous
,
D.
, &
Fill
,
J. A.
(
2002
).
Reversible Markov chains and random walks on graphs
.
Unfinished monograph
.
Amar
,
J.
(
2006
).
The Monte Carlo method in science and engineering
.
Computing in Science and Engineering
,
8
(
2
),
9
19
.
Andrieux
,
D.
(
2011
).
Spectral signature of nonequilibrium conditions
. .
Bai
,
Z.
, &
Demmel
,
J. W.
(
1993
).
On swapping diagonal blocks in real Schur form
.
Linear Algebra and Its Applications
,
186
,
75
95
.
Banderier
,
C.
, &
Dobrow
,
R. P.
(
2000
).
A generalized cover time for random walks on graphs
. In
D.
Krob
,
A. A.
Mikhalev
, &
A. V.
Mikhalev
(Eds.),
Formal power series and algebraic combinatorics
(pp.
113
124
).
Springer
.
Belkin
,
M.
, &
Niyogi
,
P.
(
2001
).
Laplacian eigenmaps and spectral techniques for embedding and clustering
. In
T. G.
Dietterich
,
S.
Becker
, &
Z.
Ghahramani
(Eds.),
Advances in neural information processing systems 14
(pp.
585
591
).
MIT Press
.
Belkin
,
M.
, &
Niyogi
,
P.
(
2003
).
Laplacian eigenmaps for dimensionality reduction and data representation
.
Neural Computation
,
15
(
6
),
1373
1396
.
Belkin
,
M.
, &
Niyogi
,
P.
(
2008
).
Towards a theoretical foundation for Laplacian-based manifold methods
.
Journal of Computer and System Sciences
,
74
(
8
),
1289
1308
.
Berkhin
,
P.
(
2005
).
A survey on PageRank computing
.
Internet Mathematics
,
2
(
1
),
73
120
.
Bielajew
,
A. F.
(
2012
).
History of Monte Carlo
. In
F.
Verhaegan
&
J.
Seco
(Eds.),
Monte Carlo techniques in radiation therapy
.
CRC Press
.
Brandts
,
J. H.
(
2002
).
Matlab code for sorting real Schur forms
.
Numerical Linear Algebra with Applications
,
9
(
3
),
249
261
.
Brémaud
,
P.
(
1999
).
Markov chains: Gibbs fields, Monte Carlo simulation and queues
.
Springer
.
Caughman
,
J. S.
, &
Veerman
,
J. J. P.
(
2006
).
Kernels of directed graph Laplacians
.
Electronic Journal of Combinatorics
,
13
(
1
), R39.
Cemgil
,
A. T.
(
2014
).
A tutorial introduction to Monte Carlo methods, Markov chain Monte Carlo and particle filtering
. In
S.
Theodoridis
,
P.
Diniz
,
R.
Chellappa
,
P.
Naylor
, &
J.
Suykens
(Eds.),
Academic Press library in signal processing
(1:
1065
1114
).
Elsevier
.
Chapman
,
A.
, &
Mesbahi
,
M.
(
2011
).
Advection on graphs
. In
Proceedings of the IEEE Conference on Decision and Control and European Control Conference
(pp.
1461
1466
).
Chen
,
M.
,
Yang
,
Q.
, &
Tang
,
X.
(
2007
).
Directed graph embedding
. In
Proceedings of the International Joint Conference on Artificial Intelligence
(pp.
2707
2712
).
Chung
,
F.
(
2005
).
Laplacians and the Cheeger inequality for directed graphs
.
Annals of Combinatorics
,
9
(
1
),
1
19
.
Chung
,
F. R. K.
(
1997
).
Spectral graph theory
.
American Mathematical Society
.
Coifman
,
R. R.
, &
Lafon
,
S.
(
2006
).
Diffusion maps
.
Applied and Computational Harmonic Analysis
,
21
(
1
),
5
30
.
Coifman
,
R. R.
,
Lafon
,
S.
,
Lee
,
A. B.
,
Maggioni
,
M.
,
Nadler
,
B.
,
Warner
,
F.
, &
Zucker
,
S. W.
(
2005a
).
Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps
. In
Proceedings of the National Academy of Sciences
,
102
(
21
),
7426
7431
.
Coifman
,
R. R.
,
Lafon
,
S.
,
Lee
,
A. B.
,
Maggioni
,
M.
,
Nadler
,
B.
,
Warner
,
F.
, &
Zucker
,
S. W.
(
2005b
).
Geometric diffusions as a tool for harmonic analysis and structure definition of data: Multiscale methods
. In
Proceedings of the National Academy of Sciences
,
102
(
21
),
7432
7437
.
Conrad
,
N. D.
,
Weber
,
M.
, &
Schütte
,
C.
(
2016
).
Finding dominant structures of nonreversible Markov processes
.
Multiscale Modeling and Simulation
,
14
(
4
),
1319
1340
.
Creutzig
,
F.
, &
Sprekeler
,
H.
(
2008
).
Predictive coding and the slowness principle: An information-theoretic approach
.
Neural Computation
,
20
(
4
),
1026
1041
.
Dayan
,
P.
(
1993
).
Improving generalization for temporal difference learning: The successor representation
.
Neural Computation
,
5
(
4
),
613
624
.
Denton
,
P. B.
,
Parke
,
S. J.
,
Tao
,
T.
, &
Zhang
,
X.
(
2021
).
Eigenvectors from eigenvalues: A survey of a basic identity in linear algebra
.
Bulletin of the American Mathematical Society
,
59
(
1
),
31
58
.
Dongarra
,
J. J.
,
Hammarling
,
S.
, &
Wilkinson
,
J. H.
(
1992
).
Numerical considerations in computing invariant subspaces
.
SIAM Journal on Matrix Analysis and Applications
,
13
(
1
),
145
161
.
Fackeldey
,
K.
,
Sikorski
,
A.
, &
Weber
,
M.
(
2018
).
Spectral clustering for non-reversible Markov chains
.
Computational and Applied Mathematics
,
37
(
5
),
6376
6391
.
Fill
,
J. A.
(
1991
).
Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process
.
Annals of Applied Probability
,
1
(
1
).
Franceschet
,
M.
(
2011
).
PageRank: Standing on the shoulders of giants
.
Communications of the ACM
,
54
(
6
),
92
101
.
Ge
,
H.
,
Qian
,
M.
, &
Qian
,
H.
(
2012
).
Stochastic theory of nonequilibrium steady states. Part II: Applications in chemical biophysics
.
Physics Reports
,
510
(
3
),
87
118
.
Gebali
,
F.
(
2008
).
Periodic Markov chains
.
Springer US
.
Ghojogh
,
B.
,
Ghodsi
,
A.
,
Karray
,
F.
, &
Crowley
,
M.
(
2021
).
Laplacian-based dimensionality reduction including spectral clustering, Laplacian eigenmap, locality preserving projection, graph embedding, and diffusion map: Tutorial and survey.
.
Ghosh
,
D.
, &
Bellemare
,
M. G.
(
2020
).
Representations for stable off-policy reinforcement learning
. In
H.
Daumé
&
A.
Singh
(Eds.),
Proceedings of the 37th International Conference on Machine Learning
(pp.
3556
3565
).
Göbel
,
F.
, &
Jagers
,
A.
(
1974
).
Random walks on graphs
.
Stochastic Processes and Their Applications
,
2
(
4
),
311
336
.
Golub
,
G. H.
, &
Van Loan
,
C. F.
(
2013
).
Matrix computations
(4th ed.).
Johns Hopkins University Press
.
Gorban
,
A.
(
2014
).
Detailed balance in micro- and macrokinetics and micro-distinguishability of macro-processes
.
Results in Physics
,
4
,
142
147
.
Granat
,
R.
,
Kågström
,
B.
, &
Kressner
,
D.
(
2009
).
Parallel eigenvalue reordering in real Schur forms
.
Concurrency and Computation: Practice and Experience
,
21
(
9
),
1225
1250
.
Grebenkov
,
D. S.
, &
Nguyen
,
B.-T.
(
2013
).
Geometrical structure of Laplacian eigenfunctions
.
SIAM Review
,
55
(
4
),
601
667
.
Halton
,
J. H.
(
1970
).
A retrospective and prospective survey of the Monte Carlo method
.
SIAM Review
,
12
(
1
),
1
63
.
Hastings
,
W. K.
(
1970
).
Monte Carlo sampling methods using Markov chains and their applications
.
Biometrika
,
57
(
1
),
97
109
.
Hein
,
M.
,
Audibert
,
J.-Y.
, &
von Luxburg
,
U.
(
2007
).
Graph Laplacians and their convergence on random neighborhood graphs
.
Journal of Machine Learning Research
,
8
,
1325
1370
.
Huang
,
J.
,
Zhu
,
T.
, &
Schuurmans
,
D.
(
2006
).
Web communities identification from random walks
. In
J.
Fürnkranz
,
T.
Scheffer
, &
M.
Spiliopoulou
(Eds.),
Knowledge discovery in databases: PKDD 2006
(pp.
187
198
).
Springer
.
Jiang
,
D.-Q.
,
Qian
,
M.
, &
Qian
,
M.-P.
(
2004
).
Mathematical theory of nonequilibrium steady states
.
Springer
.
Joachims
,
T.
(
2003
).
Transductive learning via spectral graph partitioning
. In
Proceedings of the Twentieth International Conference on International Conference on Machine Learning
(pp.
290
297
).
Johns
,
J.
, &
Mahadevan
,
S.
(
2007
).
Constructing basis functions from directed graphs for value function approximation
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
385
392
).
Kamvar
,
S. D.
,
Klein
,
D.
, &
Manning
,
C. D.
(
2003
).
Spectral learning
. In
Proceedings of the 18th International Joint Conference on Artificial Intelligence
(pp.
561
566
).
Morgan Kaufmann
.
Kolmogoroff
,
A.
(
1936
).
Zur Theorie der Markoffschen Ketten
.
Mathematische Annalen
,
112
(
1
),
155
160
.
Kroese
,
D. P.
,
Brereton
,
T.
,
Taimre
,
T.
, &
Botev
,
Z. I.
(
2014
).
Why the Monte Carlo method is so important today
.
WIREs Computational Statistics
,
6
(
6
),
386
392
.
Levin
,
D. A.
,
Peres
,
Y.
, &
Wilmer
,
E. L.
(
2009
).
Markov chains and mixing times
.
American Mathematical Society
.
Li
,
Y.
, &
Zhang
,
Z.-L.
(
2012
).
Digraph Laplacian and the degree of asymmetry
.
Internet Mathematics
,
8
(
4
),
381
401
.
Liu
,
N.
(
2011
).
Markov chains and spectral clustering
. In
K.
Hummel
,
H.
Hlavacs
, &
W.
Gansterer
(Eds.),
Performance evaluation of computer and communication systems: Milestones and future challenges
(pp.
87
98
).
Springer
.
Lovász
,
L.
(
1993
).
Random walks on graphs: A survey
.
Open Journal of Discrete Mathematics
.
Mahadevan
,
S.
(
2005
).
Proto-value functions: Developmental reinforcement learning
. In
L. D.
Raedt
&
S.
Wrobel
(Ed.),
Machine learning: Proceedings of the Twenty-Second International Conference
(pp.
553
560
).
Mahadevan
,
S.
, &
Maggioni
,
M.
(
2007
).
Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes
.
Journal of Machine Learning Research
,
8
,
2169
2231
.
Mahadevan
,
S.
,
Maggioni
,
M.
,
Ferguson
,
K.
, &
Osentoski
,
S.
(
2006
).
Learning representation and control in continuous Markov decision processes
. In
Proceedings of the 21st National Conference on Artificial Intelligence
(
2
:
1194
1199
).
Marques
,
A. G.
,
Segarra
,
S.
, &
Mateos
,
G.
(
2020
).
Signal processing on directed graphs: The role of edge directionality when processing and learning from network data
.
IEEE Signal Processing Magazine
,
37
(
6
),
99
116
.
Meilă
,
M.
, &
Pentney
,
W.
(
2007
).
Clustering by weighted cuts in directed graphs
. In
Proceedings of the 2007 SIAM International Conference on Data Mining
(pp.
135
144
).
Meilă
,
M.
, &
Shi
,
J.
(
2000
).
Learning segmentation by random walks
. In
T.
Leen
,
T.
Dietterich
, &
V.
Tresp
(Eds.),
Advances in neural information processing systems
,
13
.
MIT Press
.
Meilă
,
M.
, &
Shi
,
J.
(
2001
).
A random walks view of spectral segmentation
. In
T. S.
Richardson
&
T. S.
Jaakkola
(Eds.),
Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics
(pp.
203
208
).
Metropolis
,
N.
,
Rosenbluth
,
A. W.
,
Rosenbluth
,
M. N.
,
Teller
,
A. H.
, &
Teller
,
E.
(
1953
).
Equation of state calculations by fast computing machines
.
Journal of Chemical Physics
,
21
(
6
),
1087
1092
.
Meyer
,
C. D.
(
2000
).
Matrix analysis and applied linear algebra
.
Society for Industrial and Applied Mathematics
.
Meyn
,
S.
,
Hagen
,
G.
,
Mathew
,
G.
, &
Banasuk
,
A.
(
2008
).
On complex spectra and metastability of Markov models
. In
Proceedings of the 47th IEEE Conference on Decision and Control
(pp.
3835
3839
).
Mieghem
,
P. V.
(
2018
).
Directed graphs and mysterious complex eigenvalues
.
Delft University of Technology
.
Ng
,
A. Y.
,
Jordan
,
M. I.
, &
Weiss
,
Y.
(
2001
).
On spectral clustering: Analysis and an algorithm
. In
T. G.
Dietterich
,
S.
Becker
, &
Z.
Ghahramani
(Eds.),
Advances in neural information processing systems 14
(pp.
849
856
).
MIT Press
.
Ng
,
K. C.
, &
Parlett
,
B. N.
(
1987
).
Programs to swap diagonal blocks.
University of California, Berkeley
,
Center for Pure and Applied Mathematics
.
Page
,
L.
,
Brin
,
S.
,
Motwani
,
R.
, &
Winograd
,
T.
(
1999
).
The PageRank citation ranking: Bringing order to the web
. In
Proceedings of the International World Wide Conference
.
Pardoux
,
E.
(
2010
).
Markov processes and applications: Algorithms, networks, genome and finance
.
Wiley
.
Pauwelyn
,
P. J.
, &
Guerry
,
M. A.
(
2021
).
Perturbations of non-diagonalizable stochastic matrices with preservation of spectral properties
.
Linear and Multilinear Algebra
,
70
,
1
31
.
Pentney
,
W.
, &
Meilă
,
M.
(
2005
).
Spectral clustering of biological sequence data
. In
Proceedings of the 20th National Conference on Artificial Intelligence
(
2
:
845
850
).
AAAI Press
.
Perrault-Joncas
,
D. C.
, &
Meilă
,
M.
(
2011
).
Directed graph embedding: An algorithm based on continuous limits of Laplacian-type operators
. In
J.
Shawe-Taylor
,
R.
Zemel
,
P.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
(pp.
990
998
).
Curran
.
Petrik
,
M.
(
2007
).
An analysis of Laplacian methods for value function approximation in MDPs
. In
Proceedings of the 20th International Joint Conference on Artificial Intelligence
(pp.
2574
2579
).
Porod
,
U.
(
2021
).
Dynamics of Markov chains for undergraduates
. https://www.math.northwestern.edu/documents/book-markov-chains.pdf.
Reilly
,
R. C.
(
1982
).
Mean curvature, the Laplacian, and soap bubbles
.
American Mathematical Monthly
,
89
(
3
),
180
198
.
Richey
,
M.
(
2010
).
The evolution of Markov chain Monte Carlo methods
.
American Mathematical Monthly
,
117
(
5
), 383.
Robert
,
C. P.
,
Elvira
,
V.
,
Tawn
,
N.
, &
Wu
,
C.
(
2018
).
Accelerating MCMC algorithms
.
WIREs Computational Statistics
,
10
(
5
).
Saerens
,
M.
,
Fouss
,
F.
,
Yen
,
L.
, &
Dupont
,
P.
(
2004
).
The principal components analysis of a graph, and its relationships to spectral clustering
. In
J.-F.
Boulicaut
,
F.
Esposito
,
F.
Giannotti
, &
D.
Pedreschi
(Eds.),
Machine learning: ECML 2004
(pp.
371
383
).
Springer
.
Sevi
,
H.
,
Rilling
,
G.
, &
Borgnat
,
P.
(
2023
).
Harmonic analysis on directed graphs and applications: From Fourier analysis to wavelets
.
Applied and Computational Harmonic Analysis
,
62
,
390
440
.
Shuman
,
D. I.
,
Narang
,
S. K.
,
Frossard
,
P.
,
Ortega
,
A.
, &
Vandergheynst
,
P.
(
2013
).
The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains
.
IEEE Signal Processing Magazine
,
30
(
3
),
83
98
.
Singh
,
R.
,
Chakraborty
,
A.
, &
Manoj
,
B. S.
(
2016
).
Graph Fourier transform based on directed Laplacian
. In
Proceedings of the 2016 International Conference on Signal Processing and Communications
(pp.
1
5
).
Spielman
,
D.
(
2019
).
Spectral and algebraic graph theory
. http://cs-www.cs.yale.edu/homes/spielman/sagt/.
Sprekeler
,
H.
(
2011
).
On the relation of slow feature analysis and Laplacian eigenmaps
.
Neural Computation
,
23
(
12
),
3287
3302
.
Stachenfeld
,
K. L.
,
Botvinick
,
M.
, &
Gershman
,
S. J.
(
2014
).
Design principles of the hippocampal cognitive map
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
.
Curran
.
Stachenfeld
,
K. L.
,
Botvinick
,
M. M.
, &
Gershman
,
S. J.
(
2017
).
The hippocampus as a predictive map
.
Nature Neuroscience
,
20
(
11
),
1643
1653
.
Stewart
,
W. J.
(
1994
).
Introduction to the numerical solution of Markov chains
.
Princeton University Press
.
Sutton
,
R. S.
, &
Barto
,
A. G.
(
2018
).
Reinforcement learning: An introduction
(2nd ed.).
MIT Press
.
Szummer
,
M.
, &
Jaakkola
,
T.
(
2001
).
Partially labeled classification with Markov random walks
. In
T. G.
Dietterich
,
S.
Becker
, &
Z.
Ghahramani
(Eds.),
Advances in neural information processing systems
,
14
(pp.
945
952
).
MIT Press
.
Tishby
,
N.
, &
Slonim
,
N.
(
2001
).
Data clustering by Markovian relaxation and the information bottleneck method
. In
T.
Leen
,
T.
Dietterich
, and
V.
Tresp
(Eds.),
Advances in neural information processing systems
,
13
.
MIT Press
.
Vempala
,
S.
(
2005
).
Geometric random walks: A survey
.
Combinatorial and Computational Geometry
,
52
.
von Luxburg
,
U.
(
2007
).
A tutorial on spectral clustering
.
Statistics and Computing
,
17
(
4
),
395
416
.
Weber
,
M.
(
2017
).
Eigenvalues of non-reversible Markov chains: A case study.
Technical report
[PubMed]
,
ZIB
.
Weinan
,
E.
,
Tiejun
,
L.
, &
Vanden-Eijnden
,
E.
(
2008
).
Optimal partition and effective dynamics of complex networks
. In
Proceedings of the National Academy of Sciences
,
105
(
23
),
7907
7912
.
Weiss
,
Y.
(
1999
).
Segmentation using eigenvectors: A unifying view
. In
Proceedings of the Seventh IEEE International Conference on Computer Vision
(
2
:
975
982
).
West
,
D. B.
(
2001
).
Introduction to graph theory
(2nd ed.).
Prentice Hall
.
Wiskott
,
L.
, &
Schönfeld
,
F.
(
2019
).
Laplacian matrix for dimensionality reduction and clustering
. .
Wiskott
,
L.
, &
Sejnowski
,
T. J.
(
2002
).
Slow feature analysis: Unsupervised learning of invariances
.
Neural Computation
,
14
(
4
),
715
770
.
Witzig
,
J.
,
Beckenbach
,
I.
,
Eifler
,
L.
,
Fackeldey
,
K.
,
Gleixner
,
A.
,
Grever
,
A.
, &
Weber
,
M.
(
2018
).
Mixed-integer programming for cycle detection in nonreversible Markov processes
.
Multiscale Modeling and Simulation
,
16
(
1
),
248
265
.
Wu
,
Y.
,
Tucker
,
G.
, &
Nachum
,
O.
(
2019
).
The Laplacian in RL: Learning representations with efficient approximations
. In
Proceedings of the 7th International Conference on Learning Representations
.
Zhang
,
X.-J.
,
Qian
,
H.
, &
Qian
,
M.
(
2012
).
Stochastic theory of nonequilibrium steady states and its applications. Part I
.
Physics Reports
,
510
(
1–2
),
1
86
.
Zhou
,
D.
,
Huang
,
J.
, &
Schölkopf
,
B.
(
2005
).
Learning from labeled and unlabeled data on a directed graph
. In
Proceedings of the 22nd International Conference on Machine Learning
(pp.
1036
1043
).

Author notes

Note: This is a corrected article. See attached erratum.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode