## Abstract

Markov chains are a class of probabilistic models that have achieved widespread application in the quantitative sciences. This is in part due to their versatility, but is compounded by the ease with which they can be probed analytically. This tutorial provides an in-depth introduction to Markov chains and explores their connection to graphs and random walks. We use tools from linear algebra and graph theory to describe the transition matrices of different types of Markov chains, with a particular focus on exploring properties of the eigenvalues and eigenvectors corresponding to these matrices. The results presented are relevant to a number of methods in machine learning and data mining, which we describe at various stages. Rather than being a novel academic study in its own right, this text presents a collection of known results, together with some new concepts. Moreover, the tutorial focuses on offering intuition to readers rather than formal understanding and only assumes basic exposure to concepts from linear algebra and probability theory. It is therefore accessible to students and researchers from a wide variety of disciplines.

## 1 Introduction

Markov chains are a versatile tool for modeling stochastic processes and have been applied in a wide variety of scientific disciplines, such as biology, computer science, and finance (Pardoux, 2010). This is unsurprising considering the number of practical advantages they offer: (1) they are easy to describe analytically, (2) in many domains they make complex computations tractable, and (3) they are a well-understood model type, meaning that they offer some level of interpretability when used as a component of an algorithm. Furthermore, as we show in this tutorial, Markov chains are temporal processes that take place on graphs. This makes them particularly suitable for modeling data-generating processes that underlie time series and graph data sets, both of which have received much attention in the fields of machine learning and data mining (Aggarwal, 2015).

The application of Markov chains requires the assumption that at least some aspect of the process being modeled has no memory. An important consequence of this assumption is that the process can be described in detail using a transition matrix. Furthermore, there exists a rich framework for describing distinct features of such processes based on the eigenvalues and eigenvectors of this matrix. This tutorial provides an in-depth exploration of this framework, making use of tools from probability theory, linear algebra, and graph theory. Since the work is intended for readers from diverse academic backgrounds, we concentrate on providing intuition for the tools used rather than strict mathematical formalism.

The material presented underlies multiple methods from different areas of machine learning, and instead of exploring these methods individually, we focus on the general properties that make Markov chains useful across these domains. Nonetheless, so that readers can appreciate the scope of the tutorial, we briefly summarize the methods that it is relevant to. In graph-based unsupervised learning, it is related to nonlinear dimensionality reduction techniques such as Laplacian eigenmaps (Belkin & Niyogi, 2001, 2003) and spectral clustering (Weiss, 1999; Ng et al., 2001; von Luxburg, 2007). These two closely related methods aim to represent data sets in a way that preserves local geometry and are traditionally formulated using graph Laplacians. However, one line of work on spectral clustering instead uses Markov chains (Meilă & Shi, 2000, 2001; Tishby & Slonim, 2001; Saerens et al., 2004; Liu, 2011; Meilă & Pentney, 2007; Huang et al., 2006; Weinan et al., 2008). Furthermore, the method of diffusion maps (Coifman et al., 2005a, 2005b; Coifman & Lafon, 2006) is a generalization of Laplacian eigenmaps that is based on Markov chains and can be tuned to different length scales in a graph, thereby allowing a multiscale geometric analysis of data sets. An in-depth survey of Laplacian eigenmaps, spectral clustering, diffusion maps, and other related methods can be found in Ghojogh et al. (2021). In the domain of time series analysis, the tutorial is relevant to slow feature analysis (SFA) (Wiskott & Sejnowski, 2002), a dimensionality-reduction technique that is based on the notion of temporal coherence and is conceptually related to Laplacian eigenmaps (Sprekeler, 2011). The ideas underlying Laplacian eigenmaps and spectral clustering have also been extended to classification problems, both for labeled (Kamvar et al., 2003) and partially labeled data sets (Szummer & Jaakkola, 2001; Joachims, 2003; Zhou et al., 2005). Finally, the material presented in this tutorial also forms the basis of various approaches to value function approximation in reinforcement learning, such as Mahadevan’s proto-value functions (Mahadevan, 2005; Mahadevan & Maggioni, 2007; Johns & Mahadevan, 2007), Stachenfeld’s work on successor representation (Stachenfeld et al., 2014, 2017), and other closely related methods (Petrik, 2007; Wu et al., 2019). Something common to many of the applications mentioned thus far is that they assume all underlying graphs to be undirected or, equivalently, that the corresponding Markov chain is reversible. This provides a number of guarantees that are crucial for these methods to work, and we explore these guarantees in depth in this tutorial. In most cases, the extension to the directed/nonreversible setting faces a number of challenges and is still actively researched. We discuss these challenges and present various solutions that have been suggested in the literature.

The rest of the text is organized as follows. In section 2, we give a general introduction to discrete-time, stationary Markov chains on finite state spaces and explore some specific types of chains in detail. Section 3 then gives a formal introduction to graphs in order to provide a more detailed description of Markov chains. In section 4, random walks are presented as a canonical transformation that turns any graph into a Markov chain, and the undirected and directed cases are considered separately to better understand the types of Markov chains that they typically give rise to. Finally, in section 5, we explore applications of the material in earlier sections in two areas of computer science literature.

## 2 Markov Chains

### 2.1 Definition

*Markov processes*are an elementary family of stochastic models describing the temporal evolution of an infinite sequence of random variables $X={Xt:t\u2208T}$, defined on a state space $S$ and indexed by a time set

*T*. Such processes respect the

*Markov property*, in which the future evolution is conditionally independent of the past, given the present state of the chain. In this tutorial, we focus on models for which time is discretized (i.e., $T=N0$), known as

*Markov chains*. Furthermore, we restrict our consideration to Markov chains defined on finite state spaces with $|S|=N$ states. In such settings, the Markov property can be formalized in terms of transition probabilities:

*homogeneous*, and its evolution can be fully described by one-step transition probabilities between pairs of states in $S$: Pr(

*X*

_{t + 1}=

*s*

_{j}|

*X*

_{t}=

*s*

_{i}) =

*P*

_{ij}. Collectively, these probabilities can be represented as an

*N*×

*N*matrix:

*transition matrix*of the Markov chain. Note that the transition probabilities from each state

*s*

_{i}are organized along the rows of the matrix in equation 2.2 meaning that the rows sum to one: $\u2211j=1NPij=1$ ∀

*i*. Another name for this type of matrix is a right stochastic matrix, and when the transition probabilities are instead organized along the columns, $P$ is instead known as a left stochastic matrix and it has columns that sum to 1.

Markov chains can also be depicted visually in the form of a graph, with the state space $S$ drawn as a collection of circles and labeled arrows between these circles representing the nonzero transition probabilities *P*_{ij}. We call this diagram the *transition graph* of a Markov chain. A formal introduction to the mathematics of graphs is given in section 3, but until then, transition graphs are simply used as an illustrative tool.

*s*

_{1}=

*S*,

*s*

_{2}=

*P*,

*s*

_{3}=

*F*, and

*s*

_{4}=

*C*, respectively. As a further simplification, you assume that transitions between these activities are Markovian. After monitoring your activities for a few days, you come up with a set of empirical transition probabilities that you use to construct a transition graph, shown in Figure 1, and the following transition matrix:

*t*= 0; then all activities are possible at

*t*= 1. In order to choose from these possibilities, one must sample from a probability vector equal to the first row of $P$, that is, Pr(

*X*

_{1}|

*X*

_{0}=

*S*) = (0.5, 0.1, 0.2, 0.2). If our sample yields

*X*

_{1}=

*C*, then this becomes the current state and we repeat the process. Doing this iteratively can generate sequences of arbitrary length, for example,

*trajectory*in the state space $S$. As is often the case when studying a stochastic process, generating single trajectories is rather uninformative since it provides no collective description of how the process tends to evolve. Naively, one way we could try to achieve such a description would be to perform a type of Monte Carlo sampling (see section 5.1) by generating several trajectories from the same starting state and summarizing the frequency with which future activities occur.

In Figures 2a to 2d we do this for *n* = 10 trajectories of length 4, each starting with *X*_{0} = *S*. Each trajectory is depicted in a specific color, and consists of points in a transition graph plotted across four time points, so that the position of each point indicates a state that one of the trajectories is in at time *t*. Other than a slight bias toward studying (*S*) at each time point, it is hard to pick out any clear patterns using only these 10 trajectories. Figures 2e to 2h show similar plots for *n* = 100, but with all points colored black and the relative occupation of each state for *t* > 0 indicated by a percentage. Finally, we increase the number of trajectories to *n* = 1000 in Figures 2i to 2l. Percentages are again used to indicate the relative state occupations, but instead of representing the trajectories with dots, we color each state in gray-scale based on the percentage values. Comparing all the plots in this figure, one can note that in the first row, it is possible to track each of the individual trajectories, whereas in the second and third rows, the focus is instead on approximating the relative probability of doing each activity at each time.

*n*→ ∞, how often is each activity done at time

*t*? The answer to this question for

*t*= 1 is given by the vector Pr(

*X*

_{1}|

*X*

_{0}=

*S*) = (0.5, 0.1, 0.2, 0.2), from which we sample at each time point in a trajectory. For

*t*= 2, a distribution vector can be calculated by evaluating all the possible trajectories that lead up to each state after two steps. For example, first consider the probability with which we are drinking coffee at

*t*= 2. Clearly, there are two ways this can happen: (1)

*S*−

*S*−

*C*(i.e.,

*s*

_{1}−

*s*

_{1}−

*s*

_{4}) and (2)

*S*−

*F*−

*C*(i.e.,

*s*

_{1}−

*s*

_{3}−

*s*

_{4}). By combining the corresponding probabilities, we get

*t*= 2, we get the following probability vector: Pr(

*X*

_{2}|

*X*

_{0}=

*S*) = (0.55, 0.05, 0.2, 0.2). A number of comments can be made at this stage. First, while we can extend this type of calculation to

*t*> 2, this quickly becomes infeasible to do by hand as the number of steps increases. It turns out that there is a simple mathematical formalism that makes these computations both more efficient and more interpretable. We introduce this formalism in the following section. Second, we can also apply the distribution picture at

*t*= 0. In the example we gave, we always started in the same state, but we can easily generalize this to the case where

*X*

_{0}is not fully determined. For example, Pr(

*X*

_{0}) = (0.5, 0, 0, 0.5) indicates that studying and drinking coffee both occur at

*t*= 0 with probability 0.5. Finally, while the trajectories of a Markov chain were random, the two distributions we obtained for

*t*= 1 and

*t*= 2 were fully determined by our initial condition of

*X*

_{0}=

*s*

_{1}=

*S*. Thus, moving from the trajectory perspective to the distribution perspective rather interestingly makes the evolution of our Markov chain look deterministic.

### 2.2 Evolution via Matrix-Vector Multiplication

The structure and action of $P$, just like any other matrix, can be evaluated using tools from linear algebra. In particular, $P$ can either multiply column vectors from the left or row vectors from the right. While typical conventions formulate matrix multiplication in the former way, the latter is more common for right stochastic matrices due to its semantic interpretation. Nonetheless, as both operations offer their own insight into the descriptive capacity of Markov chains, we outline both in this section.

*t*by $\mu (t)=(\mu 1(t),\mu 2(t),...,\mu N(t))T$. Such a distribution can easily be evolved into another distribution describing the chain at time

*t*+ 1. To see this, consider the probability of being in some state

*s*

_{j}at time

*t*+ 1, that is, μ

_{j}(

*t*+ 1). This depends both on the probability of being in each state

*s*

_{i}at time

*t*(i.e., μ

_{i}(

*t*)), as well the probability of making a transition from each state

*s*

_{i}to

*s*

_{j}(i.e.,

*P*

_{ij}). By summing over all possible states

*s*

_{i}, we therefore see that

*k*-step evolution of a chain is represented simply by $Pk$ is a result known as the Chapman-Kolmogorov equation (Stewart, 1994), and it tells us that the transition matrix is the only thing needed to evolve a starting distribution of a Markov chain arbitrarily far into the future.

A particularly special type of distribution for a Markov chain is one that is invariant under its evolution:

*N*equations implied by equation 2.18 are often referred to as the equations of

*global balance*. To understand why, consider the

*j*th equation in this set:

*flow*of probability mass into state

*s*

_{j}from all other states (including

*s*

_{j}itself when

*P*

_{jj}≠ 0). Since π

_{j}remains invariant under this flow, then an equal amount of probability mass must be flowing from

*s*

_{j}to all other states in $S$, hence the name

*global balance*. An important implication of equations 2.18 and 2.19 is that when a chain is in a stationary distribution at time

*t*, it remains there for all future time steps. We can therefore interpret such distributions as a type of

*steady state*of the underlying process.

We can also ask how much probability mass moves from one state to another in a given stationary distribution $\pi $. This is described by the following matrix:

*s*

_{i}to another state

*s*

_{j}.

*t*by $\mu (t)$, we can use this distribution to calculate the expected value of $x$:

*k*time steps into the future, then expected values are calculated in the same way but with the additional evolution of $\mu (t)$ in accordance with equation 2.17:

*s*

_{i}. In such settings, the starting distribution is a row vector with one component equal to one and all others zero, known as a

*one-hot vector*and denoted by $ei$. If we evaluate the

*k*-step expectation in equation 2.25 with $ei$ as a starting distribution, we get

*i*th element is the expected value of $x$ after

*k*steps, conditioned on the starting state being

*s*

_{i}. Formally, this is a conditional expectation,

### 2.3 Eigenvalues and Eigenvectors

Every real matrix $A\u2208RN\xd7N$ can be interpreted as a linear transformation. A central task of linear algebra is to shed light on the relationship between the numerical properties of a matrix and various aspects of the transformation that it represents. Often a single matrix represents a combination of several distinct transformations; for example, an object in two or more dimensions can simultaneously be rotated and stretched. Finding the eigenvalues and eigenvectors of a matrix is one way to partition a linear transformation into its component parts and reveal their relative magnitudes. In this section, we give a brief and informal summary of how this works and apply this to transition matrices.

An eigenvector of a matrix $A$ is a vector that is only multiplied by some number λ when multiplied by $A$. Like all other vectors, they can either be rows, that is, $lTA=\lambda lT$, or columns, that is, $Ar=\lambda r$, which are known as *left eigenvectors* and *right eigenvectors*, respectively. In both cases, the number λ is called the *eigenvalue* of the respective eigenvector. It is worth noting that real matrices, including transition matrices, can have complex eigenvalues, $\lambda \u2208C$, in which case the corresponding eigenvector is also complex, $v\u2208CN$. However, such solutions can occur only in complex conjugate pairs, meaning that λ^{*} and $v*$ are also guaranteed to be an eigenvalue-eigenvector pair, where ^{*} denotes the complex conjugate.

*k*eigenvectors. Let $YR\u2208RN\xd7k$ be a matrix whose columns are equal to

*k*right eigenvectors $A$:

*linearly independent*sets of eigenvectors. Without this condition, a set of eigenvectors can contain a lot of redundancy. For example, if $r$ is an eigenvector of $A$ with eigenvalue λ, then so too is the vector $cr$ for any $c\u2208C$. Therefore, it is possible to construct an arbitrarily large set of eigenvectors using $r$ alone, and the set of all possible vectors that can be formed in this way is known as the

*eigenspace*of $A$ corresponding to λ. If we instead look for a set of linearly independent eigenvectors of $A$, there can be at most

*N*. When a set of

*N*linearly independent eigenvectors exists, they form a basis for $RN$ (the domain on which $A$ acts). In this case, the set of eigenvectors is called an

*eigenbasis*, and the matrix $A$ is said to be

*diagonalizable*. To justify the use of this latter term, imagine that the matrix $YR$ contains

*N*linearly independent right eigenvectors. Then, by definition, it is full rank, meaning that its inverse $YR-1$ exists. If we then multiply both sides of equation 2.37 by this inverse from the left, we get

*similar*. Alternatively, $C$ is said to be a

*similarity transformation*on $B$, and since we could instead write $B=UCU-1$, the converse also holds. In words, similar matrices represent the same linear transformation but expressed in two different bases, where the matrix $U$ is known as the

*change-of-basis*matrix. In the case of equation 2.38, this means that $A$ behaves like a diagonal matrix when acting on vectors that are expressed in its eigenbasis $YR$, hence the term

*diagonalizable*.

*eigendecomposition*of $A$. Furthermore, multiplying equation 2.39 from the left with $YR-1$ yields $YR-1A=\Delta YR-1$, meaning that the rows of $YR-1$ are a set of

*N*linearly independent left eigenvectors of $A$. Therefore:

*outer product*between $r\omega $ and $l\omega $.

^{1}Writing the matrix $A$ in this way therefore gives good insight into how a diagonalizable matrix can be partitioned into distinct modes.

_{ω}≠ λ

_{γ}, then

_{ω}− λ

_{γ}) ≠ 0 by assumption, this means that $l\omega $ and $r\gamma $ must be orthogonal. Thus, when all eigenvalues are distinct and the corresponding eigenvectors are normalized to have unit Euclidean norm, the following relation between the columns of $YR$ and the rows of $YR-1$ holds:

*dual basis*of $YR$, and the two sets of vectors are collectively referred to as a

*biorthogonal system*. It is worth noting that when a diagonalizable matrix has repeated eigenvalues, there is extra freedom in the choice of left and right eigenvectors. Consequently, for such matrices, equation 2.46 is not guaranteed to hold; however, there exist certain choices of bases for which it does (Denton et al., 2021).

*c*

_{ω}are the coordinates of $\mu (t)$ in this basis. We can then use this to reexpress the one-step evolution of the chain as

_{ω}| = 1, and those corresponding to |λ

_{ω}| < 1:

where *m* is the number of eigenvalues of $P$ with |λ_{ω}| = 1. In the long time limit (*k* → ∞), the terms with |λ_{ω}| = 1 survive, whereas those with |λ_{ω}| < 1 die off, with |λ| measuring the rate of decay in the latter case. This allows us to interpret the first and second sums in equation 2.54 to represent the *persistent* and *transient* behavior of the Markov chain, respectively.

In the persistent case, we can partition the terms into three types based on the eigenvalues: (i) λ = 1, (ii) λ = −1, and (iii) $\lambda \u2208C$, |λ| = 1. We have already seen that a stationary distribution $\pi $ and the vector $\eta $ are left and right eigenvectors of type i, and in both cases, these eigenvectors represent fixed structures on the state space that persist when acted on by $P$. To capture this property, we call such eigenvectors *persistent structures*. In case ii, eigenvectors flip their sign when acted on by the transition matrix, $yTP=-yT$, and return after two steps, $yTP2=yT$. Eigenvectors of this type therefore correspond to permanent oscillations of probability mass between states in $S$, and we therefore refer to them as *persistent oscillations*. Eigenvalues of type iii are explored in more depth in section 3.5, where we show that they are always complex roots of unity, λ^{k} = 1 for some *k* > 2, which occur at equally spaced locations on the unit circle. Therefore, when their corresponding eigenvectors are acted on repeatedly by $P$, they are returned to after *k* steps, $yTPk=\lambda kyT=yT$. Thus, in analogy to case ii, they describe permanent cycles of probability mass through the state space $S$, and we therefore call such eigenvectors *persistent cycles*.

In the transient case, an analogous categorization based on the eigenvalues can be applied, consisting of the following three types: (i’) λ ∈ [0, 1), (ii’) λ ∈ (− 1, 0), and (iii’) $\lambda \u2208C$, |λ| < 1. For type i’, the corresponding eigenvectors represent perturbations to the persistent behavior that decay over time, $yTP=\lambda yT$ where $|\lambda y|=|\lambda ||y|<|y|$. We therefore call such eigenvectors *transient structures*. When |λ_{j}| ≈ 1 these structures describe sets of states that, on average, a chain spends a long time in before converging; these are known as *metastable sets* (Conrad et al., 2016). Furthermore, λ = 0 can be thought of as a limiting case of type i’ in the sense that any corresponding eigenvector decays infinitely quickly (i.e., $Py=0$) and does not exhibit oscillatory or cyclic behavior. Eigenvalues of type ii’ also decay over time, but like case ii, their negative sign means that the corresponding eigenvectors exhibit oscillatory behavior when acted on by $P$, that is, $yTP=-\lambda yT$ where $|\lambda y|=|\lambda ||y|<|y|$. We refer to eigenvectors of this type as *transient oscillations*. Eigenvalues of type iii’ generalize those of type iii to the transient case since they are also complex and represent cycles of probability mass. In contrast to type iii, though, they do not occur on the unit circle and need not be equally spaced. Since |λ| < 1, these cycles decay over time, and we therefore call them *transient cycles*. When |λ| ≈ 1, these cycles can persist for a long time and have been referred to as *dominant cycles* (Conrad et al., 2016).

The above categorizations are summarized in Table 1, where structures, oscillations, and cycles are colored in green, blue, and red, respectively, and the persistent and transient cases are shaded bright and pale, respectively. In Figure 3, we visualize the six different types of eigenvalue using this color scheme by shading the respective regions of the unit circle in which they can occur.

Before moving on, a couple of details are worth pointing out. First, the above analysis is possible only if the transition matrix $P$ is diagonalizable. One case in which this is guaranteed to hold is for symmetric matrices. However, given that transition probabilities are rarely pairwise symmetric (i.e., *P*_{ij} = *P*_{ji}), this restriction is clearly too strong. Thus, a more detailed investigation is needed in order to identify the conditions under which the above decomposition can be made, which we do in sections 3 and 4. For a more in-depth account of the theory of diagonalizable matrices, we recommend Meyer (2000). Second, a quick check reveals that all terms in equation 2.54 are dependent on the components *w*_{j} that describe the starting distribution (see equation 2.47). Because of this, in the most general case, both the persistent and transient behavior of a Markov chain can be sensitive to initial conditions. In a later section, we consider a particular type of Markov chain for which this is not the case, and provide a simplified analysis of their evolution over time.

### 2.4 Classification of States

In the following sections, we explore three types of Markov chains. In order to describe each type in detail, we first define various properties that apply to individual states, or sets thereof, in a state space $S$. This is the focus of this section.

#### 2.4.1 Communicating Classes

We start by making the following definitions related to how states in $S$ are connected:

(Accessibility). For two states *s*_{i}, $sj\u2208S$, we say that *s*_{j} is accessible from *s*_{i}, denoted *s*_{i} → *s*_{j}, when it is possible to reach *s*_{j} from *s*_{i} in *k* ≥ 0 steps, $\u2203k:(Pk)ij>0$.

(Communication). Two states *s*_{i}, $sj\u2208S$, are said to communicate if *s*_{i} → *s*_{j} and *s*_{j} → *s*_{i}, which is denoted *s*_{i} ↔ *s*_{j}.

Communication is a useful property for describing states in a Markov chain, as exemplified by the following result:

(Communicating Class). Communication is an equivalence relation, meaning that:

$\u2200si\u2208S$,

*s*_{i}↔*s*_{i}, since by definition each state can reach itself in 0 steps, $(P0)ii=(1)ii=1>0$,if

*s*_{i}↔*s*_{j}, then*s*_{j}↔*s*_{i},if

*s*_{i}↔*s*_{j}and*s*_{j}↔*s*_{k}, then*s*_{i}↔*s*_{k},

Furthermore, we can make a useful categorization of Markov chains based on the number of communicating classes they have:

(Number of Communicating Classes). Let *n* be the number of communicating classes of a Markov chain. When *n* = 1, the chain is said to be irreducible; otherwise it is reducible.

In words, an irreducible Markov chain is one in which for any pair of states, there exists a connecting path in both directions. In Figure 4, three example Markov chains are shown, with the communicating classes indicated by the dashed boxes. Take a moment to double-check why each state belongs to its communicating class, and verify that the examples in Figures 4a and 4b are reducible, whereas the one in Figure 4c is irreducible. Finally, observe from Figure 4b that it is possible for states in one communicating class to be accessible from states in another class (e.g., *s*_{3} → *s*_{4}).

Irreducible Markov chains feature in a number of subsequent sections of this tutorial due to the following result:

A Markov chain is irreducible if and only if it has a unique stationary distribution $\pi $. Furthermore, this distribution has strictly positive elements, i.e. $\pi i>0\u2200si\u2208S$.

This result can be applied to the example of Figure 4c by using the observation of the previous section that stationary distributions of a given Markov chain are eigenvectors of the associated transition matrix with eigenvalue 1. If we were to find the transition matrix of the chain in Figure 4c and compute its eigenvectors, we would indeed find a single left eigenspace of $P$ corresponding to eigenvalue 1, with strictly positive elements. When normalized such that the row sum is 1, the resulting vector is $\pi =(\pi 1,\pi 2,\pi 3,\pi 4,\pi 5,\pi 6)T=(0.09,0.09,0.07,0.31,0.18,0.26)T$, which we encourage readers to check themselves. For reducible Markov chains, the guarantees of uniqueness and positivity of the stationary probabilities no longer hold. In order to explore the stationary distributions of such chains, we introduce some new concepts for describing states.

#### 2.4.2 Recurrence and Transience

Each state $si\u2208S$ can be categorized based on how likely it is to be revisited, given that it is currently occupied. This is formalized by the following definition:

*s*

_{i}is initially occupied, the probability of returning to

*s*

_{i}is defined as

*f*

_{i}= 1 are called recurrent, and those for which

*f*

_{i}< 1 are called transient.

This is a useful way to characterize states in $S$, since it generalizes to all states within a communicating class, that is, for any communicating class $C\u2286S$: either all states in *C* are recurrent or all states in *C* are transient. Transience and recurrence are therefore examples of *class properties*, and we henceforth use the terms *recurrent class* and *transient class* for communicating classes that contain recurrent and transient states, respectively. Furthermore, for finite state spaces, a Markov chain is guaranteed to have at least one recurrent class. Building on this, we can then apply the notion of recurrence to a Markov chain as a whole:

(Recurrent Chains). A Markov chain that contains only recurrent classes is called a recurrent Markov chain.

As an illustration, we apply these concepts to the reducible chains depicted in Figure 4. In Figure 4a, there are two communicating classes, and in each one, there is no possibility to exit. This means that if a state in one of these classes is occupied, it is guaranteed to be revisited at some future time step, that is, *f*_{i} = 1 for all states in each class. Therefore, both classes are recurrent and the chain as a whole is recurrent. In Figure 4b, the main difference is that there is now the possibility to exit the blue class without returning. For example, assuming *s*_{1} is occupied, although there is a possibility that *s*_{1} can be visited again later (e.g., *s*_{1} → *s*_{2} → *s*_{1}), as soon as a transition *s*_{1} → *s*_{4} takes place, this will no longer be possible. This is why *f*_{i} < 1 for states in the blue class. Hence, while the red class is recurrent, the blue class is transient, and because of this, the chain as a whole is not recurrent.

For an irreducible Markov chain, such as the example shown in Figure 4c, every state is guaranteed to be recurrent, leading to the following proposition:

(Recurrence of Irreducible Markov Chains). All irreducible Markov chains are recurrent.

However, from the example in Figure 4a, it is clear that the converse does not hold, since it is also possible for a reducible chain to be recurrent. In fact, whether a reducible chain is recurrent or not determines certain features of the stationary distributions belonging to the chain. This is outlined by the following proposition:

(Stationary Distributions of Reducible Chains). For a reducible Markov chain with *r* recurrent classes and *t* transient classes:

Any stationary distribution has probability zero for states belonging to a transient class.

For the

*k*th recurrent class, there exists a unique stationary distribution $\pi k$ with nonzero probabilities only for states in that class.- When the number of recurrent classes
*r*is bigger than 1, stationary distributions can be formed via convex combinations of each distribution $\pi k$,meaning that there is an infinite number of stationary distributions.$\pi combo =\u2211k=1r\alpha k\pi k, with \alpha k\u22650 and \u2211k=1r\alpha k=1$(2.56) Furthermore, when the number of transient classes

*t*is zero or, in other words when the chain is recurrent, performing the procedure above with nonzero coefficients always yields distributions that are strictly positive.

We can apply these results to the two reducible chains in Figure 4. In the example of Figure 4a, the stationary distribution associated with the blue class is $\pi 1=(0.35,0.36,0.29,0,0,0)T$ and the one associated to the red class is $\pi 2=(0,0,0,0.39,0.26,0.35)T$. We can then generate an arbitrary number of extra stationary distributions by taking convex combinations of $\pi 1$ and $\pi 2$ with coefficients α_{1} and α_{2}. For example, with α_{1} = 0.25 and α_{2} = 0.75, we get $\pi =(0.09,0.09,0.07,0.29,0.2,0.26)T$. Furthermore, the last bullet point in proposition 3 tells us that since this chain is recurrent, we can be sure that any convex combination with positive coefficients yields a distribution with strictly positive entries. This property of recurrent chains is particularly relevant to our treatment of both reversible chains in section 2.6 and random walks in section 4, and we henceforth use $\pi >0$ to denote a stationary distribution with this property. In the example of Figure 4b, the red class is the only recurrent class, meaning that the stationary distribution associated with this class is the only stationary distribution of the chain. Since the transition probabilities for states in this class are the same as in the example of Figure 4a, this stationary distribution is $\pi =\pi 2$. Furthermore, in agreement with proposition 3 we see that the transient states in this chain have a stationary probability of 0.

#### 2.4.3 Periodicity

The notion that states can be revisited is also meaningful in another sense. We define the following quantity, which describes how frequently such revisits can take place.

*s*

_{i}, it is possible to return to

*s*

_{i}only in multiples of the period

*d*

_{i}. States for which

*d*

_{i}> 1 are called

*periodic*and those for which

*d*

_{i}= 1 are

*aperiodic*.

Like transience and recurrence, period is also a class property, and we use *d* to refer to the period of a whole class. This in turn allows us to define the period of a Markov chain:

(Periodicity). When all communicating classes in $S$ have period *d* > 1, the Markov chain is said to be periodic, with period *d*.

(Aperiodicity). When all communicating classes in $S$ are aperiodic, the Markov chain is said to be aperiodic.

(Mixed Periodicity). If $S$ contains communicating classes with different periods, the Markov chain is said to have mixed periodicity.

Consider the example shown in Figure 5a. This chain has two communicating classes, the red one being a transient class with *d* = 2 and the blue one being a recurrent class with *d* = 1, meaning that this chain has mixed periodicity. In Figures 5b–5d we show three irreducible Markov chains, with panels b and c having period 3 and 2, respectively, and panel d being aperiodic.

Markov processes are a broad class of models, and even under the restricted settings considered in this tutorial (discrete time, homogeneous, and finite state spaces), there are many distinct types of chains. In the following sections, we concentrate on three particular types that are relevant in applied domains.

### 2.5 Ergodic Chains

When modeling a system that evolves over time, it is important to ask what can be said, if anything, about its long-term behavior. For a Markov chain, this question can be phrased in two ways. On one hand, we can sample a single trajectory starting from some initial state and ask what the average behavior is over time: How *often* is it found in each state $si\u2208S$ for a trajectory of length *t*? On the other hand, we can describe our starting conditions as a distribution $\mu (0)$ and ask what this evolves to in the future: What is the probability of being in each state *s*_{i} at a later time *t*? We refer to these two notions of long-term behavior as the *trajectory* and *distribution* perspectives, respectively. While the analyses given so far predominantly use the latter perspective, we remind readers that in section 2.1, we introduced the idea of a distribution over $S$ by taking the limit of an infinitely large ensemble of trajectories, meaning that the two concepts are closely related.

This two-way view originates from the field of statistical physics, where physical processes can either be analyzed with temporal averages (i.e., the trajectory perspective) or ensemble averages (i.e., the distribution perspective). One class of systems that has received a lot of study in this field are those for which these two types of averaging yield the same result as *t* → ∞. Such systems are known as *ergodic systems*, and this equivalence means that a statistical description of their long-term behavior can be described simply by a single, sufficiently long sample. One implication of this is that initial conditions are forgotten over time, which makes ergodic systems particularly attractive from a simulation or modeling perspective. Finally, with this in mind, we can define an ergodic Markov chain as follows:

(Ergodic Markov Chain). An ergodic Markov chain is one that is guaranteed to converge to a unique stationary distribution.

Clearly, in order for a chain to be ergodic, it must have a unique stationary distribution. Therefore, by virtue of theorem 1, a necessary condition for a chain to be ergodic is that it is irreducible. However, there is no guarantee that an irreducible chain converges, which is the second condition of definition 12. The convergence of a Markov chain is related to its periodicity, as explained by the following result:

*d*can lead to a permanent, repeating sequence of

*d*distributions,

*d*= 2 and

*d*> 2 correspond to persistent oscillations and persistent cycles, respectively. In the case of an aperiodic Markov chain (

*d*= 1), only sequences of length 1 are allowed, which means that one of the stationary distributions is guaranteed to be reached.

We can apply this result to the irreducible Markov chain in Figure 5c. This chain has a unique stationary distribution $\pi =(0.35,0.15,0.5)T$ and a period of *d* = 2. Therefore, there is no guarantee that the chain converges to $\pi $ since it can get trapped in persistent oscillations. To observe this, we can try out different initial conditions and iteratively apply the update rule in equation 2.12. For example, starting with the distribution $\mu (0)=(0.25,0.5,0.25)T$, we get a persistent oscillation between the following two distributions: $\mu (k)=(0.175,0.075,0.75)T$ and $\mu (k+1)=(0.525,0.225,0.25)T$. However, this is not the only persistent oscillation possible for this chain, which can be observed by trying out different initial conditions. Finally, in section 3.5, we gain more insight on theorem 2 by using tools from graph theory to describe the eigenvectors of $P$ with |λ| = 1.

A key insight from theorem 2 is that a Markov chain is guaranteed to converge only if it is aperiodic. Together with irreducibility, this therefore provides the conditions under which a chain is ergodic:

(Conditions for Ergodicity). A Markov chain is ergodic if and only if it is both irreducible and aperiodic, which respectively ensure that there is a unique distribution $\pi $ and the chain always converges to this distribution. Furthermore, the distribution $\pi $ is said to be the limiting distribution of the chain.

_{i}as the long-run fraction of steps that the chain spends in each state

*s*

_{i}(Porod, 2021). Finally, analyzing the evolution of a Markov chain in terms of the eigenvectors of its transition matrix becomes somewhat simpler in the case of an ergodic chain. We have already seen that when the transition matrix of a Markov chain is diagonalizable, we can vastly simplify the computation of the

*k*-step evolution of the Markov chain (see equation 2.54). When the Markov chain is also ergodic, we know that its persistent behavior is fully described by $l1=\pi $, meaning that this expression simplifies to

### 2.6 Reversible Chains

*X*) is conditionally independent of the past (

*Z*), given the present (

*Y*). A simple calculation demonstrates that the relationship of conditional independence is symmetric. Assuming that Pr(

*X*|

*Y*,

*Z*) = Pr(

*X*|

*Y*), we find that

*time-reversal*of $X$, which we denote as $X\u02dc$. A natural question we can ask is how the transition matrix of $X\u02dc$ is related to $P$. In order to answer this, we make the assumption that at time

*t*, the chain is described by one of its stationary distributions $\pi $. Therefore:

_{i}in the denominator of equation 2.66, the time reversal of a Markov chain is only valid for a starting distribution with $\pi i>0\u2200si\u2208S$. In section 2.4.1, we have established that only recurrent chains have stationary distributions of this type, meaning that the time reversal of a Markov chain is well defined only if the chain is recurrent. Finally, using equation 2.66, it is possible to define $Prev$ in matrix notation as follows:

It is worth emphasizing that since no assumption is made in definition 13 about the number of communicating classes, it also applies in the case of reducible recurrent chains where there is an infinite number of distributions $\pi >0$ to choose from. However, in such cases, the choice of $\pi $ makes no difference:

(Time Reversal of Reducible Chains). For a reducible recurrent Markov chain $X$, the time reversal $X\u02dc$ is uniquely defined, with $P rev $ being independent of which stationary distribution $\pi >0$ is used (for the proof, see the appendix).

Moreover, the set of stationary distributions $\pi >0$ belonging to a recurrent Markov chain is the same as the set belonging to the corresponding time reversal:

(Stationary Distributions of Time Reversal). Let $X$ be a recurrent Markov chain. Then $\pi >0$ is a stationary distribution of $X\u02dc$ if and only if it is a stationary distribution of $X$.^{2} (For the proof, see the appendix.)

*reversible Markov chains*, since in any stationary distribution, the forward and backward dynamics of the chain are statistically equivalent: any trajectory

*X*

_{1},

*X*

_{2}, . . . ,

*X*

_{k − 1},

*X*

_{k}occurs with equal probability as the corresponding reversed trajectory

*X*

_{k},

*X*

_{k − 1}, . . . ,

*X*

_{2},

*X*

_{1}. The stationary dynamics of such chains therefore has no inherent arrow of time. Furthermore, since $X$ and $X\u02dc$ are indistinguishable, they have the same transition matrix, $P=Prev$, and for any stationary distribution $\pi >0$, the forward and backward flow matrices are the same. Therefore

A number of observations can be made about this definition. First, the left (right) terms represent the flow of probability from *s*_{i} to *s*_{j} (*s*_{j} to *s*_{i}), given that the chain is described by distribution $\pi $. Thus, for a reversible Markov chain in one of its stationary distributions, the flow from one state to another is completely balanced by the flow in the reverse direction, meaning that the flow matrix $F\pi $ is always symmetric for such chains. By comparison with equation 2.19, we see that detailed balance is a stronger condition than global balance, since in the latter case, there is only an equivalence between the *total* flow in and out of each state. Second, since π_{i} and π_{j} are nonzero, it follows that *P*_{ij} ≠ 0 if and only if *P*_{ji} ≠ 0. Thus, the transition structure of a reversible Markov chain always permits the return to the previous state, and because of this, the period of a reversible Markov chain can be at most 2. Third, while some sources assume irreducibility as a precondition of reversibility, we instead base our definition on the weaker condition of recurrence (Porod, 2021). This is due to the fact that we only need recurrence in order to define the time reversal of a Markov chain. Furthermore, with this convention, theorem 4 applies more broadly to reducible Markov chains, which lets us make a closer comparison between reversible Markov chains and undirected graphs in section 4. Finally, theorem 4 implies that there are two distinct ways in which Markov chains can be nonreversible: (1) they can be recurrent without satisfying detailed balance, or (2) they can be nonrecurrent. In case 1, $\Pi P\u2260PT\Pi $ for any distribution $\pi >0$, meaning that the flow matrix $F\pi $ is asymmetric, and in case 2, no positive stationary distribution exists. Finally, note that for a nonrecurrent chain, there exists the possibility that $F\pi $ is symmetric for all stationary distributions despite none of those distributions being strictly positive. For such chains, removing all transient states from $S$ produces a reversible chain. We therefore refer to such chains as *semireversible*.

To illustrate some of these points, we consider four Markov chains in Figure 6, with panels a, d, g, and j showing the transition graphs of each example, and panels b, e, h, and k showing the respective transition matrix, stationary distribution, and flow matrix (for simplicity, each example has a single recurrent class, so that both the stationary distribution and the associated flow matrix are unique). As a visual illustration of the pairwise stationary flow between states in each example, in panels c, f, i, and l, the stationary distribution is represented as a bar plot, with the portions of probability mass flowing in (left) and out (right) of each state shown as portions of each bar. The example in panel a is reversible, as can be seen by the symmetry of $F1\pi $ in panel b, or equivalently by the matching between left and right portions of all bars in panel c. The example in panel d is almost equivalent to the one in panel a, except that the outgoing transition probabilities from *s*_{1} have been slightly modified (indicated by the colored arrows in panels a and d and the colored entries of $P1$ and $P2$ in panels b and e). This modification is enough to violate detailed balance, as can be seen by the asymmetry of $F2\pi $ or the bar plot in panel f. Finally, the chains depicted in panels g and j are nonrecurrent since in both cases, state *s*_{3} has only outgoing transitions. Therefore, π_{3} = 0, and both examples are nonreversible. However, a quick check of $F3\pi $ or the bar plot in panel i reveals that the example in panel g is semireversible. Conversely, the example in panel h is identical to the one in panel g except for the outgoing transitions from *s*_{4} (again indicated by the colored arrows in panels g and j and the colored entries of $P3$ and $P4$ in panels h and k), which leads to an asymmetric stationary flow between the recurrent states.

It is worth pointing out that in our analysis above, we check for reversibility by inspecting the stationary distributions and the corresponding flow matrices of each example. However, since reversibility is a property associated with Markov chains and not with distributions, one might wonder whether there is an alternative way to formalize it based purely on the transition probabilities *P*_{ij}. Clearly, equation 2.71 prohibits one-way transitions (i.e., *P*_{ij} > 0 and *P*_{ji} = 0), but this is only a necessary condition of reversibility. Can we offer anything more precise? Fortunately, the answer is yes, and it is given by Kolmogorov’s criterion (Kolmogoroff, 1936):

*n*≥ 2 and any sequence of states $si1$, $si2$, $si3$, . . . , $sin-1$, $sin\u2208S$.

One way to understand this theorem is that for reversible Markov chains, the probability of traversing any closed path in the state space $S$ is independent of the direction of traversal. Hence, reversible Markov chains can be thought of as having *zero net circulation*. By contrast, recurrent Markov chains that are nonreversible have at least one path that violates equation 2.72, over which there is a higher probability to traverse in one direction than the other. For the example in Figure 6a, the relevant closed paths are (up to a cyclic permutation): (1) *s*_{1} ↔ *s*_{2} ↔ *s*_{4} ↔ *s*_{3} ↔ *s*_{1}, (2) *s*_{1} ↔ *s*_{2} ↔ *s*_{4} ↔ *s*_{1}, and (3) *s*_{1} ↔ *s*_{4} ↔ *s*_{3} ↔ *s*_{1}. In any of these cases, going around clockwise is equally probable as going around counterclockwise, which is to be expected since this Markov chain is reversible. The example in Figure 6d has the same closed paths available, except that the outgoing transition probabilities from *s*_{1} have been changed. This small adjustment is enough to introduce circulation on all the closed paths: for both paths 1 and 3 the counterclockwise direction is more probable since *P*_{13}*P*_{34}*P*_{42}*P*_{21} > *P*_{12}*P*_{24}*P*_{43}*P*_{31} and *P*_{13}*P*_{34}*P*_{41} > *P*_{14}*P*_{43}*P*_{31}, respectively, and for path 2, the clockwise direction is more probable since *P*_{12}*P*_{24}*P*_{41} > *P*_{14}*P*_{42}*P*_{21}. Therefore, by virtue of having at least one path with net circulation, equation 2.72 confirms that this chain is indeed nonreversible.

These analyses illustrate how theorems 4 and 5 provide two alternative but equivalent definitions of reversibility. Something common to both of these interpretations is that reversible Markov chains satisfy a type of equilibrium, either between the exchange of probability mass between pairs of states or the circulation along closed paths, respectively. In fact, the concept of detailed balance stems from early work in the field of statistical mechanics aimed at formalizing the notion of thermodynamic equilibrium on a microscopic level (Gorban, 2014). More recently, Markov chain Monte Carlo methods, which are predominantly based on reversible ergodic chains (see section 5.1), have received widespread application in the natural sciences as a way to model systems that are in thermodynamic equilibrium (Richey, 2010). Conversely, Markov chains that violate detailed balance, or equivalently those with net circulation, have been applied to the less-well-understood case of systems which are out of equilibrium (Jiang et al., 2004; Zhang et al., 2012; Ge et al., 2012). Furthermore, their stationary distributions have been referred to as *nonequilibrium steady states* (NESS) (Jiang et al., 2004; Zhang et al., 2012; Ge et al., 2012; Conrad et al., 2016; Witzig et al., 2018), which reflects the fact that such distributions are kept fixed over time via unequal flows of probability mass between states ($\pi 2$ is an example of a NESS, as can be seen in the bar plot in Figure 6f).

Reversible Markov chains are significantly easier to treat both analytically and numerically than nonreversible chains. Because of this, there exist various procedures for modifying a nonreversible Markov chain so that it becomes reversible, which is sometimes referred to as *reversibilization* (Fill, 1991; Brémaud, 1999). For a recurrent chain, this can be done by taking an average of the forward and backward transition probabilities, $P$ and $Prev$, that describe the chain and its time reversal, respectively. This averaging process can be either additive or multiplicative, leading to the following two definitions:

### 2.7 Absorbing Chains

Finally, one concept in the theory of Markov chains that is particularly relevant to applied domains is absorption. A state $si\u2208S$ is called *absorbing* if it is possible to transition into the state but not out of it, meaning that *P*_{ii} = 1 and the chain stays in *s*_{i} for all future time steps. An *absorbing Markov chain* is one for which from every state $si\u2208S$, there exists some path to an absorbing state. Since it is possible to start in a nonabsorbing state and never return, all nonabsorbing states are transient, and the presence of such states means that absorbing chains can be neither reversible nor ergodic. Absorbing chains often occur in Markov decision processses (MDPs), which are central to the field of reinforcement learning (Sutton & Barto, 2018).

The possible transitions in an absorbing chain can be partitioned into three types: (1) transient → transient, (2) transient → absorbing, and (3) absorbing → absorbing. Although the assignment of indices to states in $S$ is arbitrary, an assignment based on this partitioning simplifies the analysis of absorbing Markov chains.

*r*absorbing and

*t*transient states, the transition matrix $P$ can be arranged to have the following block structure, known as the

*canonical form*:

We depict this partitioning of transition probabilities in Figure 7a. An absorbing chain with one absorbing state is shown, with the transitions belonging to $Q$, $R$, and $1$ colored black, red, and blue, respectively. Furthermore, in Figure 7b, we show the matrices $Q$ and $R$.

Since any transient state can reach an absorbing state in a finite number of steps, the probability that the chain ends up in an absorbing state at some future time is 1. For this reason, in the infinite time limit, we can expect to see no transitions taking place between transient states, that is, $limn\u2192\u221eQn=0$. This is an advantageous property, since it means that if we sum up all powers of $Q$, known as the Neumann series of $Q$, then the contributions for larger powers get progressively smaller and the sum converges to $(1-Q)-1$ (see Meyer, 2000, p. 618). Calculating this sum for $Q$ leads to the following useful quantity, which relates transient states in $S$ (Porod, 2021):

*N*

_{ij}give the expected number of times the chain visits a transient state

*s*

_{j}before absorption, given that the chain started in a transient state

*s*

_{i}.

When analyzing an absorbing chain, it is very handy to have access to the fundamental matrix. By taking into account all nonnegative powers of $Q$, it contains information about all possible paths available between pairs of transient states. Because of this, it is a useful predictive tool that allows several properties of the Markov chain to be deduced (Porod, 2021). Furthermore, in the field of reinforcement learning, it is closely related to the successor representation (Dayan, 1993) (see section 5.2).

### 2.8 Summary

This concludes our exploration of different types of Markov chains. In Figure 8, we provide a summary of the material presented in this section in the form of a Venn diagram. In this diagram, each type of Markov chain is drawn as a circle or ellipse, with defining properties and results listed in each case. Take a moment to look at this image, and pay attention to the overlapping regions, which indicate how different types of chains are related. For a more in-depth presentation of the material in this section, we recommend Porod (2021) and Brémaud (1999).

In the next section, we introduce graphs as an alternative way to describe Markov chains and summarize insights that emerge from this description. Then, in section 4, we explore the connection between graphs and Markov chains in more depth using the notion of random walks, which allows various relationships to be made between specific types of graphs and some types of Markov chains introduced in this section.

## 3 Graphs

So far, we have implicitly been interpreting Markov chains as graphs whenever we draw a transition graph. In this section, we formally introduce the concept of graphs, which provides a foundation to the material on random walks in section 4. Readers should note that definitions in graph theory often vary among different sources. Here we use a convention that can encompass a wider variety of graphs, thereby offering greater generality.

### 3.1 Definition

A graph *G* = (*V*, *E*) is a set of *N* vertices *V* = {*v*_{1}, *v*_{2}, . . . , *v*_{N}} together with an edge set *E* containing pairs of vertices in *V*. Conceptually, *V* might represent a collection of objects and *E* a specification of how some pairs in this collection are related to one another. A natural way to categorize graphs is based on the way in which edges are defined. For instance, in an *undirected graph*, each edge has no direction and is typically denoted as (*v*_{i}, *v*_{j}) ∈ *E*, whereas in a *directed graph*, each edge has a specified starting and ending vertex and is usually denoted as (*v*_{i} → *v*_{j}) ∈ *E*. Examples of undirected and directed graphs can be seen in the first and second rows of Figure 9, respectively. Unless otherwise stated, we depict undirected edges as straight lines and directed edges as curved lines with arrowheads indicating the direction. A second distinction we can make is between *unweighted graphs*, in which one only cares about whether two vertices are related or not, and *weighted graphs*, in which each edge has a positive weight *w*_{ij} describing the strength of the relationship.^{3} In Figure 9, the examples in the first column are unweighted and all other graphs are weighted, with weights indicated by numbers next to each edge. The type of edges that a graph has is often chosen based on the type of relationship that one wants to describe. For example, assume that we have a graph *G* where vertices represent PhD students. Then, if we want to represent the relationship of being in the same research group, undirected, unweighted edges are a natural choice (such as Figure 9a). Conversely, if we want edges to describe whether one student has participated on a main project of another student, then this clearly requires directed unweighted edges (such as Figure 9d). If we now consider variants of the first and second examples, instead focusing on how similar the research topics of two students are or how much work one student has contributed to another student’s project, then we now need undirected weighted and directed weighted edges, respectively (such as Figures 9b and 9c and Figures 9e and 9f). It is worth noting that in order to assign weights to edges, one needs to specify a scale on which to measure the strength of relationship between vertices.

One can also describe graphs based on their connectivity. In an undirected graph, if there exists a path between each pair of vertices, then the graph is said to be *connected*; otherwise it is *disconnected*. The notion of connectivity can generalize to disconnected graphs if we instead consider subsets of vertices in *G*, which are known as *subgraphs*. Any subgraph that is connected but is not part of any larger connected subgraph is called a *connected component*. Both of the undirected graphs in Figures 9a and 9b are connected, whereas Figure 9c shows an example that is disconnected, with two connected components. In particular, this latter example even has a vertex that does not have any edges at all; it is known as an *isolated vertex*. For a directed graph, if there are directed paths running from *v*_{i} to *v*_{j} and from *v*_{j} to *v*_{i} for all pairs of vertices *v*_{i}, *v*_{j} ∈ *V*, then the graph is said to be *strongly connected*. Alternatively, a directed graph is *weakly connected* if for all pairs of vertices *v*_{i}, *v*_{j} ∈ *V*, it is possible to get from *v*_{i} to *v*_{j} and from *v*_{j} to *v*_{i} by any path, ignoring the direction of the edges. Clearly, a directed graph is weakly connected if it is strongly connected, but not vice versa. Furthermore, strongly or weakly connected subgraphs that are not part of any larger such subgraphs are referred to as *strongly* or *weakly connected components*, respectively. The directed graphs in Figures 9d and 9e are strongly connected, whereas the one in Figure 9f is only weakly connected and has two strongly connected components (take a moment to verify this).

### 3.2 Matrix Representation

*V*| =

*N*vertices is with an

*N*×

*N*matrix. In the unweighted case, this matrix is a binary matrix $A$, with entries

*v*

_{i}∼

*v*

_{j}) = (

*v*

_{i},

*v*

_{j}) when

*G*is undirected and (

*v*

_{i}∼

*v*

_{j}) = (

*v*

_{i}→

*v*

_{j}) when it is directed. Furthermore, the matrix $A$ is usually referred to as the

*adjacency matrix*of

*G*. This extends easily to the weighted case, where instead we have a nonnegative matrix $W$, with entries

*weight matrix*of

*G*. In Figure 9, the relevant matrix is shown below each graph, and we encourage readers to verify that they match in each case. Furthermore, something worth pointing out is that undirected graphs always have corresponding matrices that are symmetric, as can be seen in Figures 9a–9c.

For the remainder of this tutorial, we assume that the graphs we deal with are both weighted and directed. The reason we choose this convention is that it is more general. On the one hand, any unweighted graph can be considered as a special case of a weighted graph where the weights are all set to 1. Thus, we henceforth only talk about weight matrices as opposed to adjacency matrices when describing graphs numerically. On the other hand, there is one sense in which directed graphs can be thought of as a generalization of undirected graphs. If *v*_{i} and *v*_{j} are two distinct vertices that share an edge in an undirected graph, then the weight of this edge is guaranteed to appear twice in the weight matrix $W$ by virtue of being symmetric. If instead these vertices belong to a directed graph and there is an edge (*v*_{i} → *v*_{j}), then this edge appears only once in $W$. Therefore, it is possible to interpret an undirected edge between *v*_{i} and *v*_{j} as being equivalent to a pair of directed edges of the same weight, with one connecting *v*_{i} to *v*_{j} and the other connecting *v*_{j} to *v*_{i}. This is the interpretation we use throughout the rest of the tutorial whenever we refer to undirected graphs. As an example, in Figure 10, we show two equivalent depictions of an undirected graph, with the top image drawn in the usual way and the bottom image drawn using pairs of directed edges. Below these two drawings, the weight matrix of this graph is shown. One must note that interpreting undirected graphs in this way is somewhat atypical; however, it allows us a greater level of generality when dealing with different types of graphs in section 4. Furthermore, this interpretation only applies to edges between distinct vertices; edges that connect vertices to themselves are discussed in section 3.4.

### 3.3 Vertex Degrees

Once the weight matrix of a graph is known, it is easy to calculate the total weight coming in and out of each vertex. The total incoming weight of a vertex *v*_{i} can be found by summing over the *i*th column of $W$ and is known as the *in-degree* of *v*_{i}: $di-:=\u2211j=1NWji$. Conversely, the total outgoing weight of *v*_{i} is calculated by the sum over the *i*th row of $W$, $di+:=\u2211j=1NWij$, and is known as the *out-degree* of *v*_{i}. Since undirected graphs always have bidirectional edges and symmetric weight matrices, the in- and out-degrees of such graphs are always equal and are simply referred to as vertex *degrees*, denoted by *d*_{i}.^{4} For example, in the graph of Figure 10 the degrees are *d*_{1} = 3, *d*_{2} = 4, and *d*_{3} = 1. In the more general case of directed graphs, there is no guarantee that $di+=di-$. However, summing over all in- or out- degrees for any graph always produces the same number, $vol(G)=\u2211i=1Ndi+=\u2211i=1Ndi-=\u2211i=1N\u2211j=1NWij$, which is sometimes referred to as the *volume* of *G*. As an example, consider the directed graph in Figure 9e: each vertex has different in- and out-degrees, i.e. $d1-=1$, $d1+=0.5$, $d2-=2.5$, $d2+=3$, $d3-=0.8$, $d3+=1$, $d4-=2$, and $d4+=1.8$, but summing over either of the degree types yields vol(*G*) = 6.3. Nonetheless, some directed graphs can have $di+=di-$ for each vertex, and such cases are known as *balanced graphs* (Banderier & Dobrow, 2000; Aldous & Fill, 2002). In keeping with the notation of undirected graphs, we denote the vertex degrees of a balanced graph as $di=di+=di-$. An example of a balanced graph along with its corresponding weight matrix is shown in Figure 11, and a quick check reveals that summing over the rows or columns of $W$ indeed yields the same values. Just as a balanced graph is a special case of a directed graph, we can similarly say that an undirected graph is a special case of a balanced graph, and this interpretation is important in section 4.

### 3.4 Self-Loops

In the examples considered so far, all edges connect pairs of vertices that are distinct, that is, *v*_{i} ≠ *v*_{j}. While this is sometimes enforced as a rule, some conventions also allow edges to connect vertices to themselves, which are known as *self-loops*. For undirected graphs, the standard convention is that a self-loop at vertex *v*_{i} counts doubly to the vertex degree *d*_{i}, while other edges only count singly, that is, *d*_{i} = 2 × *W*_{ii} + ∑_{j ≠ i}*W*_{ij}. This somewhat counterintuitive property is typically demonstrated using the *degree sum formula* (West, 2001). For undirected graphs, this states that each edge contributes twice its weight to the volume. Since self-loops only involve a single vertex, the only way that this rule can be respected is if they count twice as much to the vertex degrees as other edges. A property that we require when dealing with undirected graphs in section 4 is that the vertex degrees are calculated by the row sums of $W$. Clearly, this property is violated by the factor of 2 that applies to undirected self-loops. As a result, in this tutorial we assume that self-loops are always directed, regardless of whether they occur in undirected or directed graphs. This is an atypical definition, since undirected graphs typically are not allowed to have directed edges. However, as can be seen from the examples in Figure 12, this preserves the fact that undirected graphs have symmetric weight matrices, whereas directed graphs have nonsymmetric weight matrices, which is sufficient for the scope of this tutorial.

We close this section by noting some similarities between our definitions of Markov chains and graphs. First, the transition matrices of Markov chains, like the weight matrices of graphs, are nonnegative. Second, in a directed graph, any entry *W*_{ij} ≠ 0 of the weight matrix describes an outgoing edge from vertex *v*_{i} to *v*_{j}, and analogously, any entry *P*_{ij} ≠ 0 of a transition matrix describes an outgoing transition probability from *s*_{i} to *s*_{j}. Putting these together, we see that in the most general sense, any Markov chain can be thought of as a directed graph, with $P$ being the associated weight matrix. Indeed, this interpretation is precisely what justifies us in visualizing a Markov chain by its transition graph. In the next section, we present some useful results that emerge as a result of this way of thinking about a Markov chain. Finally, for a comprehensive text on graph theory that covers much of the material in this section, we recommend West (2001).

### 3.5 Eigenspaces of Transition Matrices

Nonnegative matrices have received widespread attention in mathematics, and in particular their eigenvalues and eigenvectors are the focus of *spectral graph theory* (Chung, 1997). In this section, we apply some results from this field to transition matrices, considering first irreducible chains and subsequently exploring the generalization to reducible chains.

#### 3.5.1 Irreducible Chains

A fundamental result used in spectral graph theory is the *Perron-Frobenius theorem*, and while a full treatment of it is beyond the scope of this tutorial, we now summarize its key implications for transition matrices of irreducible Markov chains.

(Perron Frobenius Theorem for Irreducible Markov Chains). If $P$ is the transition matrix of an irreducible Markov chain, then:

λ = 1 is guaranteed to be an eigenvalue.

λ = 1 is a simple eigenvalue, meaning that it occurs only once.

Upon suitable normalization, the eigenvalue λ = 1 has a left eigenvector equal to the unique stationary distribution $\pi =(\pi 1,\pi 2,...,\pi N)T$ and a right eigenvector equal to $\eta =(1,1,...,1)T$.

All other eigenvalues have |λ| ≤ 1, where |·| is the complex modulus, meaning that the spectral radius of $P$ is 1.

^{5}

To illustrate the above theorem, in Figures 13a to 13c, we show the transition graphs and eigenvalue plots of three irreducible Markov chains. The first observation to make is that, in agreement with theorem 6, λ = 1 is an eigenvalue in each case and occurs only once. Furthermore, as a quick exercise, we encourage readers to find the eigenvectors of λ = 1 for each example and normalize them to obtain $\pi $ and $\eta $. Finally, the eigenvalue plots show that in each example, all eigenvalues indeed lie either on the unit circle (|λ| = 1) or within it (|λ| < 1).

*l*

_{ω, i}denotes the

*i*th component of $l\omega $. Consequently, left eigenvectors with λ ≠ 1 sum to zero, meaning that unlike stationary distributions they are not probability vectors.

Using our terminology from section 2.3, eigenvalues within the unit circle represent transient structures, transient oscillations, and transient cycles. Of the irreducible chains in Figure 13, only panels a and b have eigenvalues of this type, and in both cases, they are complex conjugate pairs describing transient cycles. Looking at the transition probabilities in each example, it is clear that these transient cycles flow clockwise around the state space.

On the other hand, eigenvalues on the unit circle other than $\lambda =1$ represent persistent oscillations and persistent cycles. Theorem 2 tells us that these are possible only when a chain is periodic. The following result sheds light on this by relating the eigenvalues with |λ| = 1 to the period of a chain (Gebali, 2008):

*d*, then there are

*d*distinct eigenvalues with modulus 1, given by

In simple terms, proposition 6 says that the eigenvalues of $P$ with modulus 1 are always *d*th roots of unity. We can verify this by checking the periodic examples in Figures 13b and 13c. In both cases, the number of eigenvalues on the unit circle is indeed equal to the period of the chain and they are also equally spaced. Furthermore, proposition 6 offers an alternative perspective on how the periodicity affects the persistent behavior of a Markov chain. For example, the chain in Figure 13a has only a single eigenvalue on the unit circle, corresponding to its unique stationary distribution. It is therefore guaranteed to end up in this distribution since all other eigenvalues have |λ| < 1. This is equivalent to the statement that this chain is ergodic, which a quick check of the transition graph confirms. Conversely, the chain in Figure 13b has an additional eigenvalue λ = *e*^{πi} = −1 on the unit circle by virtue of the fact that it has period 2. Therefore, its persistent behavior can only be fully described using both the unique stationary distribution $\pi $ and the eigenvector $y$ associated with λ = −1. For example, we know from theorem 2 that such a chain can get trapped in a persistent oscillation (i.e., $\mu 1\u2192\mu 2\u2192\mu 1\u2192\mu 2\u2192...$). For any such oscillation, $\mu 1$ and $\mu 2$ can always be expressed as a linear combination of $\pi $ and $y$, meaning that this sequence indeed *oscillates* between two points in the space spanned by these eigenvectors. While this example only involves real eigenvalues and therefore only real eigenvectors, the interpretation extends to *d* > 2, for which proposition 6 tells us that there must be complex eigenvalues with |λ| = 1. For example, the chain in Figure 13c has period *d* = 3, and it has the following three eigenvalues on the unit circle: λ_{1} = 1, $\lambda 2=e2\pi i3$, and $\lambda 3=e4\pi i3$. Analogous to the *d* = 2 case, any persistent cycle of this chain can be expressed using the three corresponding eigenvectors, which in the case of λ_{2} and λ_{3} must have complex entries. Rather interestingly, this means that for chains with period *d* > 2, persistent cycles are cycles in a complex space despite being sequences of real valued distributions.

#### 3.5.2 Reducible Chains

Applying spectral graph theory to the transition matrices of reducible Markov chains produces a weaker set of results. For example, the generalization of theorem 6 to the reducible chains is the following:

(Perron Frobenius Theorem for Reducible Markov Chains). If $P$ is the transition matrix of a reducible Markov chain, then:

λ = 1 is guaranteed to be an eigenvalue.

The number of linearly independent eigenvectors with λ = 1 is equal to the number r of recurrent communicating classes in the Markov chain.

There are many choices of left and right eigenvectors for λ = 1. However, a convenient choice that mirrors the irreducible case is to choose $\pi k$ and $\eta k$ as a pair of left and right eigenvectors for each of the recurrent communicating classes, with $\pi k$ being the unique stationary distribution associated with each class and $\eta k$ being an indicator vector with entry 1 for states in this class and zeros elsewhere.

All other eigenvalues have |λ| ≤ 1, where |·| is the complex modulus, meaning that the spectral radius of $P$ is 1.

To understand theorem 7 in more depth, consider the example in Figure 13d. This Markov chain has two recurrent communicating classes, which means that λ = 1 has a multiplicity of 2. We indicate this on the eigenvalue plot by a larger size circle. Furthermore, we color this circle half red and half blue to reflect the fact that we can choose the eigenvectors for λ = 1 based on the two recurrent communicating classes, for example, $\pi 1=(0.58,0.42,0,0)T$ and $\eta 1=(1,1,0,0)T$ as a pair of left and right eigenvectors for the red class and $\pi 2=(0,0,0.43,0.57)T$ and $\eta 2=(0,0,1,1)T$ as a pair of left and right eigenvectors for the blue class. Then, $\pi 1$ and $\pi 2$ together span the left eigenspaces of λ = 1 (including all possible stationary distributions), whereas $\eta 1$ and $\eta 2$ span the right eigenspaces of λ = 1. It is worth emphasizing that while there is an infinite number of other ways to choose the basis vectors for λ = 1, this is a convenient choice since it is the only one for which all basis vectors have strictly nonnegative entries. For example, consider the choice of $\pi \u02dc1=13\pi 1+23\pi 2=(0.19,0.14,0.29,0.38)T$ and $\eta \u02dc1=\eta =(1,1,1,1)T$ as a first pair of eigenvectors. If we then choose a second pair $\pi \u02dc2$ and $\eta \u02dc2$ that satisfy biorthogonality (see equation 2.46) within this space, they are guaranteed to contain negative entries, for example, $\pi \u02dc2=\pi 1-\pi 2=(0.58,0.42,-0.43,-0.57)T$ and $\eta \u02dc2=3\eta 1-32\eta 2=(3,3,-32,-32)T$. Thus, the choice of eigenvectors stated in theorem 7 is in some sense special since it is the only one that preserves our intuition that left eigenvectors with λ = 1 correspond to distributions over the state space $S$. For this reason, we henceforth assume this convention when referring to eigenvectors with λ = 1.

Unfortunately, there is no general analogue of equation 3.3 for reducible chains. One reason for this is that like the λ = 1 case, there are many choices of eigenvectors for λ ≠ 1. However, for a recurrent reducible chain, the transition matrix can be written in block diagonal form (see the proof of proposition 4), which means we can mirror the λ = 1 case by choosing eigenvectors with λ ≠ 1 to have nonzero entries only in a single recurrent class. If the *k*th recurrent class has *n* states, then the corresponding block is an *n* × *n* matrix, meaning that there are *n* pairs of left and right eigenvectors with nonzero entries for states in this class—one pair with λ = 1 ($\pi k$ and $\eta k$) and another *n* − 1 pairs with λ ≠ 1. We can therefore apply an equivalent argument to equation 3.3 for the *k*th class, but instead using the vector $\eta k$. Thus, for each recurrent class, the left eigenvectors with λ ≠ 1 can also be chosen such that they all sum to zero. Conversely, nonrecurrent chains cannot be written in block diagonal form, which means that this argument cannot be applied. Therefore, some eigenvectors will not sum to zero for such chains, although to our knowledge, this case has not received significant attention so far in the literature.

Looking at the eigenvalue plot of Figure 13d, we see that there are two eigenvalues with |λ| < 1, both of which are real and negative. Using the terminology from section 2.3, they therefore correspond to transient oscillations of the chain. Furthermore, since the chain is recurrent, we can apply the procedure described above and choose one eigenvector to have nonzero entries only in the red class and the other eigenvector to have nonzero entries only in the blue class. With this choice, we see that each transient oscillation takes place on a distinct communicating class, which we indicate on the plot by coloring the eigenvalues red and blue.

In the case of proposition 6, the extension to reducible chains is straightforward since one can simply apply this result individually to each recurrent communicating class of a reducible chain. Therefore, for each class of period *d*, there are *d* eigenvalues of modulus 1 that satisfy the same properties as in the irreducible case.

Finally, a couple of similarities between theorems 7 and 6 can be pointed out. First, in both theorems, λ = 1 is guaranteed to be an eigenvalue. Since every eigenvalue has at least one eigenvector, this means that we can always find a left eigenvector with λ = 1. Provided that we choose an eigenvector with nonnegative entries and normalize it to one, it is a stationary distribution of the chain. This therefore justifies our claim from section 2.2 that every finite Markov chain has at least one stationary distribution. Second, in both theorems, the eigenvalues cannot have absolute value greater than one, which is one of the assumptions we made when studying the evolution of a chain in terms of its eigenvectors and eigenvalues in section 2.3, and which justified our partitioning of equation 2.54 into persistent and transient terms.

The results of this section emerge by treating Markov chains as graphs. However, in most graphs, the outgoing edges from each vertex do not sum up to one, meaning that they cannot be interpreted as transition probabilities. Because of this, Markov chains can be more precisely interpreted as a type of normalized graph. This idea is formalized in the next section, where we introduce a well-known method for transforming any graph *G* into a Markov chain.

## 4 Random Walks

### 4.1 Definition

*G*with weight matrix $W$ that we want to normalize such that the outgoing weights from each vertex

*v*

_{i}sum up to 1.

^{6}The most obvious way to do this is to divide all entries in the

*i*th row of $W$ by $di+$. By scaling each edge weight

*W*

_{ij}by the out-degree of the starting vertex

*v*

_{i}, we obtain the transition probabilities $Pij=Wijdi+$. In order to write this in matrix notation, we first define a

*degree matrix*$D$ whose elements are given by

*random walk*on

*G*(Göbel & Jagers, 1974; Lovász, 1993; Vempala, 2005). This is a fitting name, since if we imagine an agent walking between vertices in

*G*and randomly choosing where to go at each time step based on the weights of outgoing edges, then equation 4.2 would describe the resulting Markov chain.

Qualitatively, we can say that the transformation in equation 4.2 is useful when we have a starting graph *G* that we would like to describe in probabilistic and/or temporal terms. Conversely, if we have a starting Markov chain and transition matrix $P$, knowing some matrix $W$ for which equation 4.2 holds can offer insight into the type of relationships between states that give rise to the chain. However, the latter of these two perspectives is partially complicated by the fact that the mapping from $P$ to $W$ is one-to-many, and there are in fact an infinite number of different graphs *G* that produce the same Markov chain. As an example, in Figure 14, we show two distinct weight matrices $W1$ and $W2$ that get transformed to the same transition matrix $P$ using equation 4.2. How can we describe the infinite set of graphs corresponding to a single Markov chain? In principle, it involves undoing the row normalization of equation 4.2. Thus, for a given Markov chain $X$ with transition matrix $P$, we consider all possible scalings of the rows of $P$ by positive constants. Every such scaling can be described by a diagonal matrix $A\u2208R>0N\xd7N$ that multiplies $P$ from the left to produce a single corresponding weight matrix, $W=AP$. As an illustration, in Figures 14d and 14e, we show the two scaling matrices that undo the row normalization of the transition matrix in Figure 14c and transform it back into the weight matrices $W1$ and $W2$, respectively. The following definition generalizes this to the set of all such weight matrices that can be realized in this way:

which we call the random walk set of $X$.

A few details are worth noting about definition 18. First, since the trivial scaling $A=1$ is allowed, $P\u2208RW(X)$ for any Markov chain. Second, the fact that $A\u2208R>0N\xd7N$ means that each of these matrices is invertible, with $A-1$ also being diagonal and having entries equal to the reciprocals of the diagonals of $A$. Consequently, if $W1=A1P$ and $W2=A2P$ are two weight matrices in the random walk set of a given Markov chain, we can always write $W1=A1A2-1W2=A3W2$, meaning that $W1$ and $W2$ are also related simply by a row scaling with positive constants, with $A3=A1A2-1$ describing this scaling. Therefore, the row scaling defined in equation 4.3 effectively partitions the set of all nonnegative matrices into equivalence classes. Finally, since definition 18 allows any nonzero scaling of the rows of $P$, the random walk set of any Markov chain predominantly consists of graphs that are neither undirected nor balanced. In fact, only certain types of Markov chains have random walk sets that *contain* undirected or balanced graphs, as explained by the following two results:

A Markov chain $X$ is recurrent if and only if $RW(X)$ contains balanced graphs (for the proof, see the appendix).

A recurrent Markov chain $X$ is reversible if and only if the balanced graphs in $RW(X)$ are undirected (for the proof, see the appendix).

As an illustration, in Figures 15a–15c, we show the random walk sets for the three Markov chains that were studied in Figures 6a, 6d, and 6j of section 2.6.^{7} In each case, the Markov chain is located in the center and is colored in black. Other graphs in the random walk sets are colored based on the graph type (undirected = green, balanced directed = blue, unbalanced = red), and are depicted as miniature graphs without edge weights (except for one representative example of each type). The Markov chains of Figures 15a and 15b are both recurrent, and we indeed see that their random walk sets contain balanced graphs (see theorem 9). In both figures, notice that more unbalanced graphs are drawn to reflect the fact that they are more numerous than the balanced cases. Furthermore, the chain in Figure 15a is reversible, whereas the chain in Figure 15b is nonreversible, meaning that in the former case, all balanced graphs in $RW(X)$ are undirected, and in the latter case, they are directed (see theorem 9). The Markov chain in Figure 15c is nonrecurrent, and in agreement with theorem 8, it contains only unbalanced graphs. Finally, note that we do not include a corresponding diagram for the semireversible example in Figure 6g. However, since such chains can be made reversible by removing nonrecurrent states, a simple extension of theorem 9 ensures that for such chains, there exist graphs in $RW(X)$ for which the edges between recurrent states are undirected.

To summarize these observations, in Figures 15d and 15e we show two Venn diagrams that illustrate the relationships between the different types of Markov chains and graphs considered. In Figure 15d, graphs are shown as an outer circle, with balanced graphs as a particular case, and undirected graphs as a special type of balanced graph, graphs ⊃ balanced graphs ⊃ undirected graphs (colored red, blue and green, respectively). In Figure 15e, Markov chains are organized in a similar way, Markov chains ⊃ recurrent chains ⊃ reversible chains. Moreover, the colors in the Markov chain diagram are based on the types of graphs allowed in $RW(X)$ for each type of chain. For example, reversible chains are shaded in red and green since they correspond to random walks on either undirected or unbalanced graphs.

The balanced graphs belonging to a recurrent Markov chain’s random walk set are in some sense special, since the vertex degrees have a simple relationship to the stationary probabilities:

*G*one of the balanced graphs in $RW(X)$. Then, the degrees of this graph are related to one of the stationary distributions $\pi >0$ of $X$, via

*z*= ∑

_{i}

*d*

_{i}is the volume of

*G*(for the proof, see the appendix).

For example, evaluating $diz$ for *v*_{1} in the balanced graphs in Figures 15a and 15b yields $2048\u22480.417$ and $290714\u22480.406$, respectively, which are indeed the stationary probabilities of state *s*_{1} for each of the corresponding Markov chains (see Figures 6b and 6e). This highlights something useful about balanced graphs, which is that the weight matrix $W$ allows direct calculation of one of the stationary distributions without having to simulate the random walk. Conversely, for an unbalanced graph, there is no universally valid expression relating stationary probabilities of the random walk to the vertex degrees.

Perhaps the most important conclusion to draw from this section is that one can always describe a reversible Markov chain as a random walk on *some* undirected graph. Since undirected graphs have symmetric weight matrices and since matrices of this type have received a large amount of study in mathematics, this interpretation provides a number of tools for describing reversible chains in more detail. This is the focus of the next section. Directed graphs are as of yet far less understood, meaning that the same level of description for nonreversible chains is not possible. However, in section 4.3, we explore some cases where concepts can be extended to the directed/non-reversible case. Since balanced directed graphs are less common objects in graph theory, we do not dedicate a section to them and instead consider them briefly as a special case in section 4.3.

### 4.2 Random Walks on Undirected Graphs

#### 4.2.1 Relationship to Symmetric Matrices

In this section, we explore in more depth the connections between real symmetric matrices and the transition matrices of reversible chains. We start by providing the following two results for real symmetric matrices (Meyer, 2000):

A couple of details can be pointed out about theorem 11. First, by comparing equation 4.6 to our analysis of section 2.3, we see that the columns of $Y$ are a basis of right eigenvectors of $A$ and the rows of $YT$ are the corresponding dual basis of left eigenvectors. Second, both of these bases are orthonormal since $Y$ is orthogonal. Third, the matrix $\Delta $ contains the eigenvalues of $A$, which are guaranteed to be real since all other matrices in equations 4.6 and 4.7 are real. Finally, it is worth emphasizing the existential condition of theorem 11, since not all choices of eigenvectors of a symmetric matrix obey this result. On one hand, equation 4.6 requires that the sets of left and right eigenvectors are chosen together to be a biorthogonal system. On the other hand, even if we assume this property, when a symmetric matrix has repeated eigenvalues, the corresponding eigenvectors can be nonorthogonal. However, even in this case, it is always possible to apply the Gramm-Schmidt procedure to make them orthonormal, and provided that we find such a basis, we can relate the left and right eigenvectors in the following way:

Let $A$ be a real symmetric matrix. Then, if $Y$ is an orthonormal basis of right eigenvectors of $A$ and $YT$ is the corresponding dual basis, for each eigenvalue λ_{ω} the left and right eigenvectors are equal: $l\omega =r\omega $.

Clearly, for any Markov chain, reversible or otherwise, the transition matrix is itself rarely symmetric. Therefore, the results above do not directly apply to $P$. However, reversible Markov chains can be related in a number of ways to symmetric matrices, and by virtue of these relations, they satisfy variants of the results above (Brémaud, 1999). For example, from section 2.6, we know that any flow matrix of a reversible Markov is symmetric. This fact allows us to establish the following analogue of theorem 10:

_{i}(for a proof, see the appendix).

Furthermore, theorem 9 tells us that there exist certain row scalings that transform the transition matrix $P$ of a reversible Markov chain into symmetric matrices. In fact, this is not the only scaling operation that transforms $P$ into a matrix of this type. The next result shows that such a matrix is also formed when we scale both the rows *and* columns of $P$ as follows:

The first thing to note about theorem 13 is that the output of the scaling operation is somewhat different from that in theorem 9, since in the former case, there is only a single symmetric matrix, whereas in the latter there are an infinite number. Additionally, equation 4.10 implies that $P$ and $K$ are similar matrices (section 2.3). Since similar matrices have the same eigenvalues and related sets of eigenvectors, we can use this to establish the following generalizations of theorem 11 and proposition 8:

Let $X$ be a reversible Markov chain with transition matrix $P$. Then $X$ is reversible if and only if it is diagonalizable with real eigenvalues, and there exists a basis of right eigenvectors that are orthogonal with regard to $\u2329\xb7,\xb7\u232a\Pi $ and a corresponding dual basis of left eigenvectors that are orthogonal with regard to $\u2329\xb7,\xb7\u232a\Pi -1$, where $\pi >0$ is a stationary distribution of the chain (for the proof, see the appendix).

Let $X$ be a reversible Markov chain with transition matrix $P$ and $\pi >0$ one of its stationary distributions. Furthermore, let $YR$ be a set of right eigenvectors of $P$ and $YL$ its dual basis, both of which obey the orthogonality relations of theorem 14. Then, if $r\omega $ and $l\omega $ are right and left eigenvectors of the same eigenvalue λ_{ω}, they are related via $l\omega =\Pi r\omega $ (for the proof, see the appendix).

Theorem 14 is particularly important for practical reasons. First, the diagonalizability of $P$ implies a full set of linearly independent eigenvectors. Linearly independent feature spaces are often desirable from a computational perspective since they (1) reduce the overall redundancy, (2) can express any function in $RN$, and (3) ensure that certain matrix operations are well defined. Furthermore, as already explained in section 2, having a diagonalizable transition matrix means that evolving a Markov chain becomes computationally cheaper. Second, since $P$ is a real matrix with real eigenvalues, we can always choose an eigenbasis consisting only of real valued vectors. This property is useful because real spaces are often more intuitive to deal with than complex spaces. Furthermore, in many applications of Markov chains, the underlying vector space is required to be real, either due to the semantic nature of the problem or because the algorithm being used is not suited to complex spaces.

While a linearly independent set of eigenvectors is useful, having a basis that is pairwise orthogonal with regard to the standard Euclidean product offers further analytical and numerical benefits. From theorem 14, it is clear that this is not the case for transition matrices of reversible Markov chains. However, the matrix $K$ is a normalized symmetric version of $P$, and from the proof of theorem 14, we know that the eigenvalues of these two matrices are the same and their eigenvectors are related simply by a multiplication of $\Pi \xb112$. Thus, they contain similar information regarding the relationship between pairs of states in $S$, and so $K$ can be used as a surrogate for $P$ in situations where orthogonal eigenvectors are required.

*G*. Starting with equation 4.10, this can be done by swapping stationary probabilities for degree vertices:

This expression appears in the next section, where we use $K$ to define a positive semidefinite matrix that has the same eigenvectors.

Collectively, the results of this section demonstrate that transition matrices of reversible chains satisfy similar properties to symmetric matrices, but subject to a different type of normalization. This alternative normalization is important for the next section, and in order to make our analyses there more concise, we end this section by defining the following two coordinate transformations (Liu, 2011):

*G*be an undirected graph with

*N*vertices and degree matrix $D$. If $x\u2208RN$ is a vector defined over these vertices, we define the following related vectors:

#### 4.2.2 Normalized Graph Laplacian

A symmetric matrix $A\u2208RN\xd7N$ is positive semidefinite if $xTAx\u22650\u2200x\u2208RN$, or equivalently if all its eigenvalues are nonnegative. Such matrices have a number of numerical properties that make them useful for solving optimization problems, and because of this, they often appear in computational applications. The matrix $K$ is not positive-semidefinite, since it has the same eigenvalues as $P$ (which can be negative). However, it is straightforward to define a variant of $K$ that does have this property, while also having the same eigenvectors as $K$:

*G*with weight matrix $W$, the normalized graph Laplacian is given by

$L$ is symmetric and positive semidefinite, with eigenvalues in the interval [0, 2].

λ = 0 is guaranteed to be an eigenvalue, and its multiplicity is equal to the number of connected components in

*G*.If

*C*is a connected component of*G*, then there is an eigenvector $yC$ with eigenvalue λ = 0 whose entries are $di12$ for*v*_{i}∈*C*and 0 otherwise, or, equivalently, $yC=D121C=1C,L$, where $1C$ is an indicator vector for vertices in*C*.Therefore, for fully connected graphs, $y=D121=1L$ is the unique eigenvector with λ = 0, where $1$ is a vector of ones.

There exist other types of graph Laplacians for undirected graphs, such as the unnormalized Laplacian $L:=D-W$ and the random walk Laplacian $LRW:=1-D-1W$. Many useful properties of a graph *G* can be obtained from graph Laplacians, and they form the basis of spectral graph theory (Chung, 1997). Due to its close relationship to $P$, in this tutorial, we predominantly focus on $L$. In the case of $L$, there is a weaker connection to $P$, which we only discuss briefly. Furthermore, to avoid redundancy, we do not discuss $LRW$ at all, since it is the same as $-P$ but shifted by the identity. For a concise review of each of these objects, see Wiskott and Schönfeld (2019).

As the name suggests, graph Laplacians are analogues of the Laplace operator. In particular, a number of studies have found links between graph Laplacians and a variant of the Laplace operator on manifolds, known as the Laplace-Beltrami operator (Belkin & Niyogi, 2008; Hein et al., 2007). Broadly speaking, Laplace operators measure how much the value of a function at a point varies from its local average, which is informally related to the notion of mean curvature (Reilly, 1982). Furthermore, the eigenvalues of this operator are nonnegative real numbers and are related to how much the corresponding eigenfunctions vary over the domain of the function (Grebenkov & Nguyen, 2013). All of these properties have analogues in the case of $L$, which justifies the term given to this object. To see this, we start with the following result (von Luxburg, 2007):

*G*be an undirected graph with weight matrix $W$ and normalized Laplacian $L$. Then for any vector $x$ defining values on the vertices of

*G*,

which is guaranteed to produce a real nonnegative number since $L$ is positive semidefinite (for the proof, see the appendix).

In equation 4.20, the quadratic term measures the difference between the entries of a vector that is related to $x$ by normalization with $D-12$, that is, the vector $xR$. Since this is minimized when the entries of this vector are similar for pairs of vertices with large edge weight *W*_{ij}, it can be interpreted to describe the smoothness of $xR$ with respect to the connectivity of *G*. Furthermore, since this measure describes the smoothness of $xR$ as opposed to that of $x$, it is useful for dealing with graphs that have a very heterogeneous degree structure.

_{ω}describes precisely this notion of smoothness for $y\omega $. Now assume that we have an orthonormal set of eigenvectors of $L$. From definition 20, we know that the corresponding eigenvalues are real numbers in the interval [0, 2] and that there is guaranteed to be at least one eigenvalue λ = 0. Thus, they can be ordered as follows:

*G*, according to equation 4.20. For example, taking an eigenvector with λ = 0 (see definition 20) and plugging it into equation 4.20 indeed yields zero. These are therefore the eigenvectors of $L$ that have the lowest smoothness score in equation 4.20. Similarly, the eigenvectors of $L$ with the second smallest eigenvalue are those that, subject to orthonormality, score the second lowest in equation 4.20.

_{ω}increases. Since this basis is orthonormal, it can express any signal $x\u2208RN$ on

*G*, with the resulting components describing the projection of $x$ onto each of these oscillatory modes. Due to its correspondence with Fourier analysis, this decomposition has been referred to as the

*graph Fourier transform*of $x$ (Shuman et al., 2013). Moreover, a useful result in spectral graph theory is that a basis of eigenvectors corresponding to the

*k*leading eigenvalues of $L$ is one that, subject to orthonormality, has the lowest possible total score in equation 4.20. In other words, if $Y\u2208RN\xd7k$ represents such a basis of right eigenvectors, it solves the following optimization problem (Belkin & Niyogi, 2001, 2003; Wiskott & Schönfeld, 2019):

These observations can be related to the random walk on *G*. Since $L$ shares an eigenbasis with $K$ and the eigenbasis of $K$ is related to that of $P$ by theorem 13, we can relate the eigenvectors of $L$ to those of $P$ as follows:

If $y\omega $ is an eigenvector of $L$ with eigenvalue λ_{ω}, then $l\omega =D12y\omega =y\omega ,L$ and $r\omega =D-12y\omega =y\omega ,R$ are left and right eigenvectors of $P$, respectively, with eigenvalue 1 − λ_{ω}.

Because of this, the eigenvectors of $P$ have a similar form to those of $L$, but subject to the coordinate transformations of definition 19. Therefore, they can also be interpreted using the notion of smoothness, with λ = 1 in some sense being the smoothest case.

To illustrate the relationship between the eigenvectors of $L$ and $P$, we consider a simple Markov chain consisting of a linear arrangement of 100 states, where only nearest neighbor transitions are allowed (see Figure 16a). Transitions to the left and right occur with probability 0.48 and 0.52, respectively, and the self loops at *s*_{1} and *s*_{100} mean that staying fixed is allowed in these states. Take a moment to verify that this chain is both ergodic and reversible (hint: in the latter case, try theorem 5). Once we know the stationary probabilities π_{i} of this chain, we can calculate the corresponding matrices $K$ and $L$ by using equation 4.10 and then equation 4.16. In Figure 16b, we plot the left and right eigenvector of $P$ with λ = 1, which are the stationary distribution $\pi $ and $\eta =(1,1,...,1)T$, respectively. The stationary probabilities are larger for states near the right end of the chain by virtue of the tendency for rightward transitions in this chain. Additionally, we plot the corresponding eigenvector of $L$ with λ = 0, which is $y=(\pi 112,\pi 212,...,\pi 10012)T$. This vector is in some sense an intermediary between $\pi $ and $\eta $ since, in agreement with proposition 10, $\pi =\Pi 12y=D12y=yL$ and $\eta =\Pi -12y=D-12y=yR$ (as indicated by the black arrows in Figure 16b).

In Figure 16c, we show four more eigenvectors of $L$ having eigenvalues closest to 0. In agreement with our discussion, the eigenvectors get less smooth as λ increases, becoming more oscillatory and resembling trigonometric functions over the state space. In Figures 16d and 16e, we do the same for the left and right eigenvectors of $P$, respectively. The smoothness of these eigenvectors also depends on the size of λ; however, this time we are interested in those with eigenvalues closest to 1, and the smoothness decreases as λ decreases. Furthermore, in comparison to Figure 16c, these eigenvectors have an additional weighting effect across the state space. The left eigenvectors have larger amplitudes for states on the right-hand side, which is intuitive since these states have higher stationary probabilities π_{i}, and the eigenvectors can be obtained from those of $L$ by the left coordinate transformation of definition 19. For the right eigenvectors, the weighting is the opposite, with states on the left-hand side having larger amplitudes. This is explained by an equivalent argument, except that this time, we apply the right coordinate transformation to the eigenvectors of $L$. It should be noted that for the purpose of visualization, all eigenvectors in Figures 16c to 16e are normalized to have Euclidean norm 1, even though for the eigenvectors of $P$, this is not the natural normalization (see theorem 14).

*G*has a degree structure that varies somewhat smoothly from left to right, which is partly responsible for the other eigenfunctions also being so smooth. To see how this generalizes as we make the transition probabilities less homogeneous, we consider a modified chain, which we show in Figure 17a. In this model, left transitions from a state

*s*

_{i}happen with probability 0.52 + ξ

_{i}and right transitions with probability 0.48 − ξ

_{i}, where ξ

_{i}is a number sampled from a uniform distribution on the interval [−0.1, 0.1]. Therefore, this chain is like the previous example, but with a random perturbation at each state. In Figures 17b to 17e, we show the corresponding versions of Figures 16b to 16e. As one can see, the eigenvectors of $L$ as well as the left eigenvectors of $P$ have similar shapes to the previous case, but they appear more rugged due to the inhomogeneity of the transition probabilities. Given that the perturbations made to the previous model are small, this is an illustration of the fact that these sets of eigenvectors are very sensitive to local information. Rather interestingly, the same is not true for the right eigenvectors of $P$: in Figure 17b, the right eigenvector with λ = 1 is the same as in the previous example, and in Figure 17e, the other right eigenvectors are partially distorted but still appear somewhat smooth visually. This can be understood in the following way. Since the right eigenvectors of $P$ are the right transformations of the eigenvectors of $L$, we can reformulate the optimization problem in equations 4.24 and 4.25 in terms of the basis $YR=D-12Y$:

*W*

_{ij}. In the context of the example in Figure 17, this means that such eigenvectors have similar values for neighboring vertices, and a quick check reveals that this is indeed the case in Figure 17d. Conversely, for the eigenvectors in Figures 17b and 17c the values on average vary more between neighboring vertices.

Finally, we note that the ordering of eigenvalues used in this section is somewhat different from that in section 2.3. In this section, the eigenvalues of $L$ are ordered from 0 up to 2, which corresponds to the eigenvalues of $P$ being ordered from 1 to −1. To reflect our interpretation, we call this *ordering by smoothness*. In section 2, however, the eigenvalues of a transition matrix are ordered by their absolute value, which describes how long the contribution from the corresponding eigenvector persists as the chain evolves. We therefore call this choice *ordering by persistence*. The suitability of either of these types of ordering depends on the specific problem domain in which a Markov chain is being used. For example, in the case of spectral clustering, they correspond to distinct objectives, as demonstrated in Liu (2011). Furthermore, these two types of ordering appear in Creutzig and Sprekeler (2008), in which the authors compare slowness and predictability as objectives for dimensionality reduction of time series data.

This concludes our treatment of random walks on undirected graphs. As a summary of the results given, in Table 2 we compare the mathematical properties and relationships of $L$, $K$, and $P$. The material presented in this section is particularly important for applications in machine learning and data mining in which the data set can be formulated as a graph. In particular, it underlies work that has been done on problems such as spectral clustering (Meilă & Shi, 2000, 2001; Tishby & Slonim, 2001; Saerens et al., 2004; Liu, 2011; Weinan et al., 2008), manifold learning/graph embedding (Coifman et al., 2005a; Coifman & Lafon, 2006), graph-based classification (Kamvar et al., 2003; Szummer & Jaakkola, 2001; Joachims, 2003), and value function approximation in reinforcement learning (Mahadevan, 2005; Mahadevan & Maggioni, 2007; Petrik, 2007; Stachenfeld et al., 2014, 2017). In the next section, we consider how, if at all, the material presented in this section generalizes to directed graphs.

. | $P$ . | $K$ . | $L$ . |
---|---|---|---|

Relationship to $W$ | $D-1W$ | $D-12WD-12$ | $1-D-12WD-12$ |

Diagonalizable | ✔ | ✔ | ✔ |

Symmetric | ✗ | ✔ | ✔ |

Positive semidefinite | ✗ | ✗ | ✔ |

Eigenvalues | λ_{ω} ∈ [−1, 1] | λ_{ω} ∈ [−1, 1] | λ_{ω} ∈ [0, 2] |

Eigenvectors | lin. indep. | orthogonal | orthogonal |

Left eigenvectors | $y\omega ,L$ | $y\omega $ | $y\omega $ |

Right eigenvectors | $y\omega ,R$ | $y\omega $ | $y\omega $ |

. | $P$ . | $K$ . | $L$ . |
---|---|---|---|

Relationship to $W$ | $D-1W$ | $D-12WD-12$ | $1-D-12WD-12$ |

Diagonalizable | ✔ | ✔ | ✔ |

Symmetric | ✗ | ✔ | ✔ |

Positive semidefinite | ✗ | ✗ | ✔ |

Eigenvalues | λ_{ω} ∈ [−1, 1] | λ_{ω} ∈ [−1, 1] | λ_{ω} ∈ [0, 2] |

Eigenvectors | lin. indep. | orthogonal | orthogonal |

Left eigenvectors | $y\omega ,L$ | $y\omega $ | $y\omega $ |

Right eigenvectors | $y\omega ,R$ | $y\omega $ | $y\omega $ |

### 4.3 Random Walks on Directed Graphs

Broadening our consideration to directed graphs is necessary if we want to describe nonreversible Markov chains. However, since many of the guarantees established in section 4.2 do not hold for directed graphs, this case is a lot harder to treat analytically. In section 4.3.1 we explore the main challenges that occur when applying spectral graph theory to the transition matrices of nonreversible Markov chains, and in section 4.3.2 we describe methods for circumventing these issues. Finally, in section 4.3.3 we define a generalization of $L$ to directed graphs, and in section 4.3.4 we present a method for enforcing ergodicity on random walks of directed graphs.

#### 4.3.1 Key Difficulties

Since many of the guarantees established in section 4.2 do not hold for directed graphs, they are a lot harder to treat analytically. Perhaps most important, the transition matrices of nonreversible chains are neither guaranteed to be diagonalizable nor to have real eigenvalues (Weber, 2017), which can be observed even for simple cases such as those shown in Figures 18a–18c. There are nonetheless some cases for which both properties hold (Weber, 2017) (as shown by the example in Figure 18d), but it is still not fully understood to what degree, if at all, the transition structure of a nonreversible chain determines either the diagonalizability of $P$ or whether its eigenvalues are real or complex.^{8} When $P$ is nondiagonalizable, it does not have a set of *N* linearly independent eigenvectors, which can cause numerical issues since in this case, some matrix operations are computationally more expensive or not well defined. Moreover, if the eigenvalues of $P$ are complex, then so are its eigenvectors. As in the real case, we can choose to order these eigenvectors based on persistence or smoothness. In the former case, the generalization is somewhat straightforward since |λ| still describes how long each eigenvector typically persists. In the latter case, however, the question of how to generalize the concept of smoothness to complex eigenvectors is nontrivial and is still an actively researched topic in the literature (Sevi et al., 2023; Marques et al., 2020). These factors make analyzing the transition matrices of nonreversible Markov chains more challenging than in the reversible case.

#### 4.3.2 Alternative Methods

One general technique for treating a nondiagonalizable matrix $X$ is to add a perturbation so that it becomes diagonalizable. This is based on the notion that diagonalizable matrices densely fill the set of all matrices (Golub & Van Loan, 2013), meaning that it is always possible to find some nearby matrix $X'$ that is diagonalizable. Pauwelyn and Guerry (2021) develop a method along these lines for dealing with nondiagonalizable transition matrices. In particular, for a starting transition matrix $P$, a perturbation matrix $E$ is found such that $P'=P+E$ preserves a number of the spectral properties of $P$ and is diagonalizable. However, two limitations of this method are that it has computational complexity $O(N8)$ for *N* × *N* matrices and the resulting transition matrix can still have complex eigenvalues.

Other lines of work have attempted to circumvent these issues by using alternative matrix decompositions, with a prominent example being the real Schur decomposition (Stewart, 1994; Conrad et al., 2016; Weber, 2017; Fackeldey et al., 2018; Ghosh & Bellemare, 2020).^{9} This decomposition provides a set of real orthogonal basis vectors, known as Schur vectors, that spans the eigenspaces of $X$ (Golub & Van Loan, 2013). This basis is not unique and corresponds to some ordering of the eigenvalues of $X$, with the first *k* Schur vectors spanning the eigenspaces of the first *k* eigenvectors in this ordering. Therefore, given some Schur decomposition of $X$, if $Uk=(u1,u2,...,uk)$ is the set of first *k* Schur vectors, then for any linear combination $u\u02dc=\u2211\gamma =1kc\gamma u\gamma $, it is guaranteed that $Xu\u02dc\u2208span(Uk)$. For this reason, *U*_{k} is said to be an *invariant subspace* of $X$ (Golub & Van Loan, 2013), and when *k* < < *N* it provides a low-dimensional description of the transformation that $X$ represents. The real Schur decomposition is therefore a useful alternative to the eigendecomposition. However, so that the basis captures the most important information about $X$, a reordering algorithm is needed to specify which eigenspaces of $X$ it should span, and various methods for this have been developed (Ng & Parlett, 1987; Dongarra et al., 1992; Bai & Demmel, 1993; Granat et al., 2009; Brandts, 2002). Furthermore, it is worth emphasizing that in contrast to the eigendecomposition, the real Schur decomposition is guaranteed to exist for any real square matrix $X$, meaning that it sidesteps the issues of nondiagonalizability and complex feature spaces that can occur with transition matrices of nonreversible Markov chains. In the field of machine learning, the real Schur decomposition of transition matrices has been used as a tool for clustering (Fackeldey et al., 2018) as well as for building state representations in reinforcement learning (Ghosh & Bellemare, 2020).

#### 4.3.3 Directed Normalized Graph Laplacian

In section 4.2.2, the normalized graph Laplacian $L$ was introduced as a way to get a more precise description of the left and right eigenvectors belonging to transition matrices of reversible Markov chains. Generalizing $L$ to directed graphs is challenging since two of its defining features are that it is symmetric and positive semidefinite, neither of which can be satisfied by equation 4.18 if $W$ is nonsymmetric. However, various definitions for directed graphs exist, and while some loosen the constraint that $L$ should be positive semidefinite (Agaev & Chebotarev, 2005; Caughman & Veerman, 2006; Li & Zhang, 2012; Singh et al., 2016), others strictly enforce this via a type of symmetrization (Chung, 2005). We here focus on the latter type and demonstrate connections that this has to some of the material in section 2.

Perhaps the simplest method along these lines is to symmetrize the weight matrix $W$ of a directed graph *G* to get an alternative weight matrix $W sym $, for example $W sym =W+WT$ or $W sym =WTW$, and then use the regular definition of the normalized Laplacian using this new matrix: $1-D-12W sym D-12$. Since $W sym $ describes an undirected graph, the resulting object can be interpreted in the same way as section 4.2.2. However, a major drawback of this approach is that the graphs described by $W$ and $W sym $ can have very different structural properties. For instance, there is no guarantee that the random walks on these two graphs have stationary distributions that bear any resemblance to one another. Indeed, various studies in machine learning have indicated that symmetrizing $W$ leads to a significant erasure of structural information from a directed graph (Pentney & Meilă, 2005; Mahadevan et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007).

*G*, but for now, we simply assume that $\pi >0$. Equation 4.29 can be simplified as

*G*, it has been observed empirically that this effect is less severe in comparison to methods that symmetrize $W$ itself (Pentney & Meilă, 2005; Mahadevan et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007). For example, note that $P$ and $PA$ describe random walks that have the same stationary distributions (see definition 14).

The directed normalized Laplacian has been used in various contexts of machine learning as a way to generalize methods that are restricted to undirected graphs. It has been applied to problems such as spectral clustering (Meilă & Pentney, 2007; Huang et al., 2006; Liu, 2011), graph embedding (Chen et al., 2007; Perrault-Joncas & Meilă, 2011), graph-based classification (Zhou et al., 2005), and value function approximation in reinforcement learning (Johns & Mahadevan, 2007). In most of these applications, ergodicity is enforced on the random walk described by $P$, and in the next section, we introduce the standard method for doing this.

#### 4.3.4 Random Surfer Model

As explained in section 2.5, ergodic Markov chains have the useful property that they are guaranteed to converge to a unique stationary distribution, and various reasons were given for why this is desirable in a general context. For directed graphs, it is a particularly beneficial property, since without it, a random walk can get trapped in a small cluster of states, or even a single absorbing state. Because of this, sometimes ergodicity is enforced for random walks on directed graphs (Page et al., 1999; Zhou et al., 2005; Huang et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007). We remind readers that one effect this has is that $\pi >0$, meaning that the directed normalized Laplacian is well defined.

*G*with weight matrix $W$, at each time step, there are two possible outcomes: either a regular random walk is performed with probability α, or the process teleports randomly to any vertex with probability 1 − α. This is known as a

*random surfer model*or

*teleporting random walk*, and it appeared for the first time in the PageRank algorithm (Page et al., 1999). If $P$ and $Ptel$ represent the transition probabilities of the two possible outcomes at each time point, then the overall process is described by the following transition matrix,

*Google matrix*(Franceschet, 2011).

It is worth noting that there exist many variants of the random surfer model that differ in the assumptions they make about $Ptel$ (Berkhin, 2005). For example, teleporting transitions can either be uniformly random or biased toward certain vertices through a set of weights. Furthermore, the parameter α ∈ [0, 1] is known as the *damping factor* and determines how close the process is to a regular random walk. Typically, it is set close to 1, so that the process still accurately reflects the structure of the underlying graph *G*.

*G*, we can interpret $1N$ as the transition matrix of a random walk on a graph

*G*

_{C}that has the same number of vertices as

*G*but where each vertex

*v*

_{i}is connected to all others, including itself.

^{10}An example is shown in Figure 19, with panel a showing the starting graph

*G*and panel b showing

*G*

_{C}. In panels c and d, we show the transition matrices $P$ and $Ptel$ of the random walks on

*G*and

*G*

_{C}, respectively. Finally, in panel e, the transition matrix of the overall teleporting random walk with α = 0.85 is shown.

Due to the teleportation term $Ptel$, from any given state *s*_{i} there is always a nonzero probability to access any other state or to stay in the same state. By virtue of this, such processes are guaranteed to be both irreducible and aperiodic, and therefore ergodic.

### 4.4 Summary

This concludes our treatment of random walks. The material presented in this section forms a useful framework that connects Markov chains and graphs. On the one hand, describing a Markov chain as a process taking place on a graph is a useful interpretation since it provides intuition about underlying relationships between states. Furthermore, it allows one to apply the tool kit of spectral graph theory to Markov chains. On the other hand, graphs by themselves represent only static relationships between entities, and performing a random walk is one way to describe a graph in terms that are dynamic or temporal. Moreover, the fact that a transition matrix can be easily exponentiated (i.e $Pk$) means that a random walk provides information about a graph *G* at multiple timescales, which is a property that has been exploited in the field of manifold learning (Coifman et al., 2005a, 2005b; Coifman & Lafon, 2006). In this section, many concepts and results from linear algebra are required, for which we recommend Meyer (2000) as a general resource and Stewart (1994) as a more specific summary of the application to transition matrices. Moreover, we recommend Spielman (2019) as a text on spectral graph theory, where readers can find a more in-depth exploration of graph Laplacians.

## 5 Related Applications

While this tutorial focuses on mathematical concepts, the material has a number of applications in machine learning, as well as computer science more generally. A number of these are mentioned in section 1 and various parts of the main text. In order to provide some concrete examples, this section explores two of the most actively researched areas of application.

### 5.1 Markov Chain Monte Carlo

*hypothetical population*simply means some distribution μ(

*x*) and the term

*parameter*can be interpreted as the expectation of a well-behaved function ϕ(

*x*) computed over this distribution. Thus, if

*q*is a quantity of interest, then MC methods start by formalizing it as such an expectation (Cemgil, 2014),

*x*) (i.e.

*x*

_{1},

*x*

_{2}, . . . ,

*x*

_{k}) with which this expectation can be evaluated empirically. The resulting approximation of the quantity

*q*is then:

*q*. Fortunately, for independent and identically distributed samples, the law of large numbers (LLN) guarantees that

where σ^{2} is the variance of ϕ(*x*). Therefore, Monte Carlo methods provide an unbiased estimate of *q* with a standard deviation that scales like $O(k-12)$.

*u*(

*x*) defined across the total surface. Clearly, the probability of a single raindrop falling inside the circle is given by

*x*) is a function that is equal to one when a raindrop falls inside the circle and 0 otherwise. Our next step is to construct an empirical approximation for π in the form of equation 5.2. If we sample

*k*raindrops from the distribution

*u*(

*x*), this approximation reads

where *n*_{circle} is the number of raindrops observed to fall inside the circle. Clearly, generating truly uniform rainfall in a physical context is not feasible. However, such a model can be easily simulated with the help of pseudo-random number generators. Figure 20a depicts the result of such a simulation for *k* = 100, for which the resulting Monte Carlo approximation is $\pi ^\u22483.24$. While this is not very accurate, equations 5.3 and 5.4 tell us that the approximation should improve on average as *k* increases. We visualize this in Figure 20b, where *k* ranges from 1 to 500, and for each value of *k*, the approximation $\pi ^$ is carried out 100 times. The blue line shows the mean value of $\pi ^$ found at *k*, which gets closer to the true value as *k* increases, and the gray area shows a single standard deviation, which indeed appears to scale like $O(k-12)$ and indicates that fluctuations from the true value get smaller on average as *k* increases.

In the example, the Monte Carlo approximation is particularly straightforward because the distribution we needed to sample from was very simple. However, often this is not the case. Indeed, a large portion of Monte Carlo techniques are specifically designed to deal with situations in which sampling directly from the target distribution is difficult or impossible.

In particular, Markov chain Monte Carlo (MCMC) involves the construction of a Markov chain, defined over the sample space of the problem being studied, which is guaranteed to converge to the target distribution. Thus, by initializing such a chain and waiting until it converges, one can eventually generate samples to use for a Monte Carlo approximation. We have already seen that in order to have guarantees of convergence to a unique stationary distribution, a chain needs to be ergodic, and so the goal of MCMC is to find such an ergodic chain for a given stationary distribution $\pi $. A wide variety of MCMC methods exist, but by far the most famous is the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970). Below we summarize this algorithm, and to maintain consistency with the rest of the tutorial, we focus on the case where the relevant state space is discrete. However, the ideas presented can be generalized to the continuous case, and indeed, most interesting applications of the algorithm involve continuous state spaces.

*proposal distribution*for generating transitions must be known a priori, which we denote $Pijprop=p(sj|si)$, and for reasons that become clear below, we assume that the chain corresponding to these transition probabilities is irreducible. The generated transitions are then accepted with the following probability:

Two points can be made about the expression above. First, although equation 5.8 involves stationary probabilities, the MH algorithm does not require that $\pi $ is known explicitly. Instead, it is only assumed that some positive function^{11} proportional to $\pi $ is known, that is, $\pi =fZ$, where *Z* is a normalizing constant that is typically unknown or otherwise intractable. This means that ratios of stationary probabilities in equation 5.8 can be evaluated without needing to take care of *Z*: $\pi j\pi i=fj/Zfi/Z=fjfi$. Second, the denominator on the right-hand side of equation 5.8 is always positive because $Pijprop>0$ for any proposed transition and π_{i} ∝ *f*_{i} > 0 ∀*i*, meaning that the fraction is always well defined.

*a*

_{ij}induce a resulting Markov chain with the following transition probabilities:

*i*≠

*j*since detailed balance is guaranteed for

*i*=

*j*:

*t*→ ∞.

Two key challenges facing the application of this algorithm, as well as other MCMC methods more generally, are the following (Johansen & Ludger, 2007). First, the time required for a Markov chain to get close to its stationary distribution, known as the *mixing time*, can be very long. Because of this, MCMC methods typically involve an initial *burn-in* period in which samples are discarded. One issue with this procedure is that it is rarely straightforward to judge how long is sufficient for a given target distribution, proposal chain, and initialization. However, in the case of an ergodic and reversible chain, such as in the MH algorithm, some guidance on this can be taken from a well-known result in Markov chain theory, which says that the mixing time for such chains is upper- and lower-bounded by values related to the *spectral gap* γ = 1 − λ_{2}, where λ_{2} is the second largest eigenvalue of the associated transition matrix (Levin et al., 2009). Second, samples from a Markov chain that occur close together in time are often highly autocorrelated, which clearly violates the i.i.d. property and reduces the effective sample size. A typical method for treating this, known as *thinning*, is to use only every *m*th sample generated from the chain. Something common to both of the issues described above is that they often make MCMC methods very slow in practice. Because of this, techniques for acceleration have received a lot of attention in the literature and are still an active area of research (Robert et al., 2018).

### 5.2 Reinforcement Learning

Reinforcement learning (RL) is a framework for studying how agents can learn behaviors that maximize reward signals by interacting with their environment. The canonical paradigm for this type of learning are *Markov decision processes* (MDP), which are based on Markov chains. In this section, we introduce MDPs and outline how the material presented so far can be applied to these models.

MDPs are stochastic control processes, whereby an agent is in a state *s* at each time point *t* and chooses an action *a* from those that are available in *s*, then finds itself in a state *s*^{′} at time *t* + 1 and receives a scalar reward *r*_{t + 1}. Formally, this can be defined as a 5-tuple $M=(S,A,p,r,\gamma )$, where $S$ is a state space, $A$ is an action space, *p*(*s*^{′}|*s*, *a*) is a transition model describing the probability of moving to *s*^{′} when taking action *a* in state *s*, *r*(*s*, *a*) is a reward function describing the instantaneous reward received when taking action *a* in state *s*, and γ ∈ [0, 1) a discount factor (Sutton & Barto, 2018). Together *p* and *r* define the dynamics of the MDP, and both can be deterministic or stochastic, as long as they respect the Markov property, which requires that *r*_{t + 1} and *s*_{t + 1} depend only on *s*_{t} and *a*_{t}. Furthermore, while in the most general case $S$, $A$, and *t* can be either discrete or continuous, in order to maintain consistency, we here consider the simplest setting where all are discrete.

*policy*μ(

*a*

_{t}|

*s*

_{t}) that specifies how likely the agent is to make a given action in a given state. The general goal of RL involves interleaving the following: evaluating how well a given policy μ optimizes reward, known as the

*prediction problem*, and finding a new policy μ that improves on μ, known as the

*control problem*. In order to solve either of these problems, one must first define a meaningful measure of reward for a policy μ. Since RL is concerned with sequential forms of behavior, cumulative rewards are more relevant than the instantaneous rewards received at each time point. Therefore, the typical way to measure how much reward a policy μ receives is to consider each state in $s\u2208S$ and calculate the cumulative future reward that an agent can expect when it is in state

*s*at time

*t*and following policy μ. This is known as the

*state-value*function and can be written as

*Bellman equation*. The matrix $P\mu $ is the result of combining the policy μ and the environment’s transition model

*p*by marginalizing over actions, $Pij\mu =\u2211a\mu (a|si)p(sj|si,a)$, and describes the Markov chain induced across $S$ when the agent behaves according to μ. The vector $r\mu $ involves the same type of marginalization, $ri\mu =\u2211a\mu (a|si)r(si,a)$, and its entries describe the instantaneous reward expected in each state

*s*.

In this section, we outline a few of the key ways in which Markov chains are important in RL, in particular focusing on the relationships between value functions and transition matrices. One of the assumptions underlying our analysis is that the environment’s transition probabilities and reward function are known a priori. However, in virtually all practical applications, this will not be the case, meaning that $v\mu $ must be computed using sampled interactions with the environment. Even in these settings, it is very common that concepts and results from Markov chain theory are relevant, and we recommend Sutton and Barto (2018) as a general text explore this further.

## 6 Conclusion

The key motivation of this tutorial is to provide a single introductory text on the spectral theory of Markov chains. By bringing together concepts and results from different areas of mathematics, we hope this work is a useful resource for readers aiming to gain a broad, yet concise, overview of the topic. Our presentation involves two different paradigms for interpreting and analyzing Markov chains. Section 2 presents a categorization based on the transition structure and asymptotic behavior, and section 3 formalizes Markov chains as a type of graph. In section 4, these two perspectives are connected by introducing the idea of a random walk, which provides a number of parallels between some categories of Markov chains and certain types of graphs. In particular, one theme that aligns the two perspectives is the distinction between reversible and nonreversible Markov chains on the one hand, and undirected and directed graphs on the other hand, where in both cases, the former option is easier to treat than the latter. With the additional use of results from linear algebra, we arrive at an in-depth description of the eigenvalues and eigenvectors of transition matrices in the reversible case. Furthermore, we discuss various attempts that have been made to generalize spectral methods to the nonreversible case. Finally, section 5 explores two areas of computer science literature in which various concepts and results from the foregoing sections are used. Although the material mostly consists of known results, two novel contributions are the categorization of eigenvalues given in Table 1 and Figure 3, as well as the notion of random walk sets (see definition 18). Since we only assume minimal exposure to concepts from linear algebra and probability theory, and since focus is placed on providing intuition rather than rigorous results, the material of this tutorial is accessible to researchers and students in a variety of quantitative disciplines. For those working in fields related to machine learning and data mining, this work is particularly relevant due to the applications discussed at various points.

## Appendix: Proofs

*r*communicating classes. Therefore, for a suitable indexing of states in $S$, the transition matrix of this chain has a block diagonal form:

*k*th class. Furthermore, if $\pi >0$ is one of its stationary distributions, it can be written as $\pi =\u2211k=1r\alpha k\pi k$, where each $\pi k$ has nonzero entries only in the

*k*th class and α

_{k}> 0 (see proposition 3). Thus, given the same indexing of states, the matrices $\Pi $ and $\Pi -1$ have the following forms:

*k*th class for the time-reversed Markov chain. Furthermore, we can easily evaluate these block matrices:

_{k}terms that parameterize the stationary distribution, meaning that the time-reversed Markov chain is the same regardless of which distribution is considered.

*j*th row or column of the flow matrix $F\pi $ gives the same result:

(⇒) Assume that $P$ is the transition matrix of a recurrent Markov chain. Then, for any stationary distribution $\pi >0$, the flow matrix $F\pi =\Pi P$ corresponds to one of the allowed graphs in $RW(X)$. Therefore, using the same argument given in the proof of proposition 5, the row and column sums of $F\pi $ are the same, meaning that this matrix describes a balanced graph.

*G*is a balanced graph with weight matrix $W$ and that $X$ is the Markov chain realized by a random walk on this graph. Now consider the distribution $\pi =1z(d1,d2,...,dN)T$, where $z=\u2211j=1Ndj=vol(G)$ is needed in order for $\pi $ to sum to 1. We can then easily verify that the equations of global balance hold for this distribution (see equation 2.19):

and so $\pi TP=\pi T$, which means that $\pi $ is a stationary distribution of the chain. Note that the summation in the fourth expression defines the in-degree of vertex *v*_{j}, but since the graph *G* is balanced, we only have one degree *d*_{j} associated with each vertex. Finally, since isolated vertices are not allowed, each degree must be bigger than zero, meaning similarly that π_{i} > 0 $\u2200si\u2208S$. Hence, there are no transient states, and $X$ is recurrent.

*G*is one of the balanced graphs in $RW(X)$. Using the same argument from the second part of the proof of theorem 8, the degrees of

*G*are related to a stationary distribution $\pi >0$ of the chain via $\pi i=diz$. From theorem 4, we know that $X$ is reversible if and only if the flow matrix associated $\pi $ is symmetric, and it is straightforward to show that this equivalent to

*G*being undirected:

*G*arbitrarily, this applies to any balanced graph in $RW(X)$.

See the second part of the proof of theorem 8.

*i*th entry equal to 1 and zeros elsewhere, this reduces to

*j*th entry equal to one and zeros elsewhere, we get

*i*and

*j*have been chosen arbitrarily, equation A.33 holds between any pair of states. Therefore, the chain is reversible (see theorem 4).

*i*,

*j*)th element of $K$ is

_{k}> 0 (see proposition 3). Since all entries of $\pi $ are greater than zero, $Kij=\pi iPij1\pi j\u22600$ iff

*P*

_{ij}≠ 0. For a recurrent chain, the latter can only be true for pairs of states in the same communicating class. If

*s*

_{i}and

*s*

_{j}belong to the

*k*th communicating class, then

_{k, i}denotes the

*i*th component of the stationary distribution associated with the

*k*th class. Hence, the nonzero entries of $K$ do not depend on the values of α

_{k}, meaning that they are irrespective of the stationary distribution used.

_{ω}, then $r\omega =\Pi -12y\omega $ and $l\omega =\Pi 12y\omega $ are a pair of corresponding right and left eigenvectors of $P$ with the same eigenvalue. Using this, we see that if $r\omega $ and $r\gamma $ are a pair of right eigenvectors of $P$, then

_{ω}, then $r\omega =\Pi -12y\omega $ and $l\omega =\Pi 12y\omega $ are right and left eigenvectors of $P$, respectively, with the same eigenvalue. Therefore:

*G*and $x\u2208RN$. Then:

## Acknowledgments

We thank Jonathan Hermon for useful discussions on Markov chain theory, as well as Josué Tonelli-Cueto for his insight into the Perron-Frobenius theorem and various concepts in linear algebra.

## Notes

^{1}

In component notation, the outer product between two vectors $x=(x1,x2,...,xN)T$ and $y=(y1,y2,...,yN)T$ is $xyT:=x1x2\vdots xNy1y2\cdots yN=x1y1x1y2\cdots x1yNx2y1x2y2\cdots x2yN\vdots \vdots \ddots \vdots xNy1xNy2\cdots xNyN.$

^{2}

For continuous time Markov chains, this result is known as Kelly’s lemma.

^{3}

We restrict edge weights to be positive in order to maintain this notion of strength; however, it is worth noting that some conventions in graph theory allow negative weights.

^{4}

Since in graph theory unweighted graphs are more commonly studied than weighted graphs, the degree quantities we have defined are sometimes referred to as *weighted degrees* (Chapman & Mesbahi, 2011), but for simplicity we just use the term *degree*.

^{5}

We remind readers that the spectral radius of a square matrix $X$ is its largest eigenvalue in absolute value and is denoted $\rho (X)$. The name relates to the fact that all eigenvalues are contained within a disk of radius $\rho (X)$ centered at the origin of the complex plane.

^{6}

For this definition to work, we require that all vertices have $di+>0$, meaning that isolated vertices or vertices with only incoming edges are forbidden.

^{7}

While the examples here each have a single communicating class, the analysis would apply equally to chains with multiple classes.

^{9}

It should be noted that most of these studies primarily consider nonreversible chains that are ergodic.

^{10}

Sometimes objects like *G*_{C} are referred to as *complete graphs* or *fully connected networks*. However, such terms typically do not include the possibility of self-loops, which we by definition need since we consider uniform teleportation.

^{11}

Note that some references make the milder assumption of a nonnegative function, but that this can be made positive by removing states for which π_{i} = 0.

## References

## Author notes

Note: This is a corrected article. See attached erratum.