A Tutorial on the Spectral Theory of Markov Chains

Abstract Markov chains are a class of probabilistic models that have achieved widespread application in the quantitative sciences. This is in part due to their versatility, but is compounded by the ease with which they can be probed analytically. This tutorial provides an in-depth introduction to Markov chains and explores their connection to graphs and random walks. We use tools from linear algebra and graph theory to describe the transition matrices of different types of Markov chains, with a particular focus on exploring properties of the eigenvalues and eigenvectors corresponding to these matrices. The results presented are relevant to a number of methods in machine learning and data mining, which we describe at various stages. Rather than being a novel academic study in its own right, this text presents a collection of known results, together with some new concepts. Moreover, the tutorial focuses on offering intuition to readers rather than formal understanding and only assumes basic exposure to concepts from linear algebra and probability theory. It is therefore accessible to students and researchers from a wide variety of disciplines.


Introduction
Markov chains are a versatile tool for modelling stochastic processes, and have been applied in a wide variety of scientific disciplines, such as biology, computer science, and finance [1]. This is unsurprising considering the number of practical advantages they offer: (i) they are easy to describe analytically, (ii) in many domains they make complex computations tractable, and (iii) they are a well understood model type, meaning that they offer some level of interpretability when used as a component of an algorithm. Furthermore, as we show in this tutorial, Markov chains are temporal processes that take place on graphs. This makes them particularly suitable for modeling data generating processes that underlie time series and graph data sets, both of which have received much attention in the fields of machine learning and data mining [2].
The application of Markov chains requires the assumption that at least some aspect of the process being modelled has no memory. An important consequence of this assumption is that the process can be described in detail using a transition matrix. Furthermore, there exists a rich framework for describing distinct features of such processes based on the eigenvalues and eigenvectors of this matrix. This tutorial provides an in-depth exploration of this framework, making use of tools from probability theory, linear algebra and graph theory. Since the work is intended for readers from diverse academic backgrounds, we concentrate on providing intuition for the tools used rather than strict mathematical formalism.
The material presented underlies multiple methods from different areas of machine learning, and instead of exploring these methods individually we focus on the general properties that make Markov chains useful across these domains. Nonetheless, so that readers can appreciate the scope of the tutorial, we now briefly summarize the methods that it is relevant to. In graph-based unsupervised learning, it is related to nonlinear dimensionality reduction techniques such as Laplacian eigenmaps [3,4] and spectral clustering [5,6,7]. These two closely related methods both aim to represent data sets in a way that preserves local geometry, and are traditionally formulated using graph Laplacians. However, one line of work on spectral clustering instead uses Markov chains [8,9,10,11,12,13,14]. Furthermore, the method of diffusion maps [15,16] is a generalization of Laplacian eigenmaps that is based on Markov chains, and can be tuned to different length scales in a graph, thereby allowing a multiscale geometric analysis of data sets. An in-depth survey of Laplacian eigenmaps, spectral clustering, diffusion maps, as well as other related methods can be found in [17]. In the domain of time series analysis, the tutorial is relevant to slow feature analysis (SFA) [18], a dimensionality reduction technique that is based on the notion of temporal coherence and is conceptually related to Laplacian eigenmaps [19]. The ideas underlying Laplacian eigenmaps and spectral clustering have also been extended to classification problems, both for labelled [20], and partially labelled data sets [21,22,23]. Lastly, the material presented in this tutorial also forms the basis of various approaches to value function approximation in reinforcement learning, such as Mahadevan's proto-value functions [24,25,26], Stachenfeld's work on the successor representation [27,28], and other closely related methods [29,30]. Something common to many of the applications mentioned thus far is that they assume all underlying graphs to be undirected, or equivalently that the corresponding Markov chain is reversible. This provides a number of guarantees that are crucial for these methods to work, and we explore these guarantees in-depth in this tutorial. In most cases, the extension to the directed/non-reversible setting faces a number of challenges and is still actively researched. We discuss these challenges and present various solutions that have been suggested in the literature.
The rest of the text is organized as follows. In Section 2, we give a general introduction to discrete-time, stationary Markov chains on finite state spaces and explore some specific types of chains in detail. Section 3 then gives a formal introduction to graphs in order to provide a more detailed description of Markov chains. In Section 4, random walks are presented as a canonical transformation that turns any graph into a Markov chain, and the undirected/directed cases are considered separately to better understand the types of Markov chains that they typically give rise to.

Definition
Markov processes are an elementary family of stochastic models describing the temporal evolution of an infinite sequence of random variables X = {X t : t ∈ T }, defined on a state space S and indexed by a time set T . Such processes respect the Markov property, in which the future evolution is conditionally independent of the past, given the present state of the chain. In this tutorial, we focus on models for which time is discretized, i.e. T = N 0 , known as Markov chains. Furthermore, we restrict our consideration to Markov chains defined on finite state spaces with |S| = N states. In such settings, the Markov property can be formalized in terms of transition probabilities: If these probabilities are themselves independent of time, the Markov chain is said to be homogeneous, and its evolution can be fully described by one-step transition probabilities between pairs of states in S: Pr(X t+1 = s j |X t = s i ) = P ij . Collectively, these probabilities can be represented as an N × N rightstochastic matrix: P =      P 11 P 12 · · · P 1N P 21 P 22 · · · P 2N . . . . . . . . . . . .
which is called the transition matrix of the Markov chain and has the property that the rows sum to one, i.e. Markov chains can also be depicted visually in the form of a graph, with the state space S drawn as a collection of circles and labelled arrows between these circles representing the non-zero transition probabilities P ij . We call this diagram the transition graph of a Markov chain. A formal introduction to the mathematics of graphs is given in Section 3, but until then transition graphs are simply used as an illustrative tool.
As an example, imagine you are a PhD student who wants to evaluate how efficiently you work. In order to simplify your analysis, you posit that at any time of a working day you are doing one of four activities: (i) studying, (ii) speaking to your professor, (iii) eating food, and (iv) drinking coffee, which you denote as a set of states s 1 = S, s 2 = P , s 3 = F and s 4 = C, respectively. As a further simplification, you assume that transitions between these activities are Markovian. After monitoring your activities for a few days, you come up with a set of empirical transition probabilities which you use to construct a transition graph, shown in Figure 1, and the following transition matrix: With either of these two representations, it is straightforward to generate a realization of this Markov chain. To do this, we first need to pick a starting activity. Suppose that you are studying at time t = 0, then all activities are possible at t = 1. In order to choose from these possibilities, one must sample from a probability vector equal to the first row of P , i.e. Pr(X 1 |X 0 = S) = (0.5, 0.1, 0.2, 0.2). If our sample yields X 1 = C, then this becomes the current state and we repeat the process. Doing this iteratively can generate sequences of arbitrary length, for example: which we refer to as a trajectory in the state space S. As is often the case when studying a stochastic process, generating single trajectories is rather uninformative since it provides no collective description of how the process tends to evolve. Naively, one way we could try to achieve such a description would be to perform a type of Monte Carlo sampling by generating several trajectories from the same starting state and summarizing the frequency with which future activities occur. In Figure 2(a-d), we do this for n = 10 trajectories of length 4, each starting with X 0 = S. Each trajectory is depicted in a specific color, and consists of points in a transition graph plotted across 4 time points, so that the position of each point indicates a state that one of the trajectories is in at time t. Other than a slight bias towards studying (S) at each time point, it is hard to pick out any clear patterns using only these 10 trajectories. Figure 2(e-h) shows similar plots for n = 100, but with all points colored black and the relative occupation of each state for t > 0 indicated by a percentage. Finally, we increase the number of trajectories to n = 1000 in Figure 2(i-l). Percentages are again used to indicate the relative state occupations, but instead of representing the trajectories with dots we color each state in gray-scale based on the percentage values. Comparing all the plots in this figure, one can note that in the first row it is possible to track each of the individual trajectories, whereas in the second and third rows the focus is instead on approximating the relative probability of doing each activity at each time.
The plots of Figure 2 represent collections of random experiments, and therefore running them again would not yield exactly the same outcomes. However, taking a frequentist interpretation of probability, we can ask: In the limit n → ∞, how often is each activity done at time t? The answer to this question for t = 1 is given by the vector Pr(X 1 |X 0 = S) = (0.5, 0.1, 0.2, 0.2), from which we sample at each time point in a trajectory. For t = 2, a distribution vector can be calculated by evaluating all the possible trajectories that lead up to each state after two steps. For example, first consider the probability with which we are drinking coffee at t = 2. Clearly, there are two ways this can happen: (i) S − S − C (i.e. s 1 − s 1 − s 4 ), (ii) S − F − C (i.e. s 1 − s 3 − s 4 ). By combining the corresponding probabilities, we get: Pr(X 1 = s j |X 0 = s 1 )Pr(X 2 = s 4 |X 1 = s j ) (4) = P 11 P 14 + P 12 P 24 + P 13 P 34 + P 14 P 44 (6) = 0.1 + 0 + 0.1 + 0 (7) = 0.2 (8) By performing a similar calculation for the other activities at t = 2, we get the following probability vector: Pr(X 1 |X 0 = S) = (0.55, 0.05, 0.2, 0.2). A number of comments can be made at this stage. Firstly, while we can extend this type of calculation to t > 2, this quickly becomes unfeasible to do by hand as the number of steps increases. It turns out that there is a simple mathematical formalism which makes these computations both more efficient and more interpretable. We introduce this formalism in the following section. Secondly, we can also apply the distribution picture at t = 0. In the example we gave, we always started in the same state, but we can easily generalize this to the case where X 0 is not fully determined. For example, Pr(X 0 ) = (0.5, 0, 0, 0.5) indicates that studying and drinking coffee both occur at t = 0 with probability 0.5. Lastly, while the trajectories of a Markov chain were random, the two distributions we obtained for t = 1 and t = 2 were fully determined by our initial condition of X 0 = s 1 = S. Thus, moving from the trajectory perspective to the distribution perspective rather interestingly makes the evolution of our Markov chain look deterministic.

Evolution via matrix-vector multiplication
The structure and action of P , just like any other matrix, can be evaluated using tools from linear algebra. In particular, P can either multiply column vectors from the left, or row vectors from the right. While typical conventions formulate matrix multiplication in the former way, the latter is more common for right stochastic matrices due to its semantic interpretation. Nonetheless, as both operations offer their own insight into the descriptive capacity of Markov chains, we outline both in this section. For any vector x ∈ R N , we see that by multiplying it in row form with P from the right gives a new vector z ∈ R N : where z j = N i=1 x i P ij . By summing over the elements of z, we see that: meaning that multiplying with P from the right preserves the sum over vector entries. Furthermore, note that since all entries of P are non-negative, if x is non-negative then so too is z. These two properties are particularly useful when dealing with probability vectors, which by definition both sum to 1 and have non-negative entries, since it means that a probability vector x is mapped into another probability vector when multiplied from the right by P . As we have seen, probability vectors are a suitable representation for tracking the evolution of a Markov chain, and as a shorthand we describe the distribution of a chain at time t by µ(t) = (µ 1 (t), µ 2 (t), ..., µ N (t)) T . Such a distribution can easily be evolved into another distribution describing the chain at time t + 1. To see this, consider the probability of being in some state s j at time t + 1, i.e. µ j (t + 1). This depends both on the probability of being in each state s i at time t, i.e. µ i (t), as well the probability of making a transition from each state s i to s j , i.e. P ij . By summing over all possible states s i , we therefore see that: Using these probabilities, we can now form the probability vector µ(t+1) = (µ 1 (t+1), µ 2 (t+1), ..., µ N (t+1)) T , which is described by the vector-matrix analogue of Equation (11): Thus, multiplying probability vectors with P from the right represents the one-step evolution of a Markov chain. Furthermore, this operation can be extended to multiple steps of evolution by raising the power of P : = µ(t) T P · · · P k times where it is straightforward to establish that P k is itself stochastic. The fact that the k-step evolution of a chain is represented simply by P k is a result known as the Chapman-Kolmogorov equation [31], and it tells us that the transition matrix is the only thing needed to evolve a starting distribution of a Markov chain arbitrarily far into the future. A particularly special type of distribution for a Markov chain is one which is invariant under its evolution: Definition 2.2.1 (Stationary distribution). A distribution π that is invariant when multiplied by P from the right, i.e. π T = π T P is said to be a stationary distribution of the Markov chain, and for finite state spaces it is guaranteed that at least one such distribution exists.
The set of N equations implied by Equation (18) are often referred to as the equations of global balance. To understand why, consider the j-th equation in this set: The right-hand term in Equation (19) represents the total flow of probability mass into state s j from all other states (including s j itself when P jj = 0). Since π j remains invariant under this flow, then an equal amount of probability mass must be flowing from s j to all other states in S, hence the name global balance. An important implication of Equation (18) and Equation (19) is that when a chain is in a stationary distribution at time t, it remains there for all future time steps. We can therefore interpret such distributions as a type of steady state of the underlying process. We can also ask how much probability mass moves from one state to another in a given stationary distribution π. This is described by the following matrix: Definition 2.2.2 (Flow matrix). Given that a Markov chain is in one of its stationary distributions π, the corresponding flow matrix is defined as: where Π := diag(π) is a diagonal matrix consisting of the entries of π, and (F π ) ij = π i P ij is the stationary flow of probability mass from one state s i to another state s j .
So far we have only considered row vectors, and their interpretation as probability vectors was fitting due to the way they transform when multiplied by P . For reasons that we outline below, this interpretation does not apply in the case of column vectors, which are multiplied by P from the left. To show this, we first assume that x = (x 1 , x 2 , ..., x N ) T is any vector in R N . Since this vector assigns a value to each state, we can interpret it as a function on S. If the chain is described at time t by µ(t), we can use this distribution to calculate the expected value of x: If we want to look k time steps into the future, then expected values are calculated in the same way but with the additional evolution of µ(t) in accordance with Equation (17): Hence, given any starting distribution µ(t), the transition matrix can produce expected values of any function on the state space arbitrarily far into the future. Often, the initial condition of a Markov chain has zero uncertainty, with all probability occupying a single state s i . In such settings, the starting distribution is a row vector with one component equal to 1 and all others zero -known as a one-hot vector, and denoted by e i . If we evaluate the k-step expectation in Equation (25) with e i as a starting distribution, we get: This tells us how a vector x is transformed when multiplied by P k from the left -it produces a new vector x = P k x, whose i-th element is the expected value of x after k steps, conditioned on the starting state being s i . Formally, this is a conditional expectation: which is the main operation underlying value functions in reinforcement learning [32]. Summing over the elements of x gives: since the columns of P k , unlike its rows, do not sum to one. Hence, multiplying column vectors with P k does not preserve the vector sum, which is why they are best interpreted as functions rather than distributions.

Eigenvalues and eigenvectors
Every real matrix A ∈ R N ×N can be interpreted as a linear transformation. A central task of linear algebra is to shed light on the relationship between the numerical properties of a matrix and various aspects of the transformation that it represents. Often, a single matrix represents a combination of several distinct transformations -e.g. an object in two or more dimensions can simultaneously be rotated and stretched. Finding the eigenvalues and eigenvectors of a matrix is one way to partition a linear transformation into its component parts and to reveal their relative magnitudes. In this section, we give a brief and informal summary of how this works, and apply this to transition matrices. An eigenvector of a matrix A is a vector which is only multiplied by some number λ when multiplied by A. Like all vectors, they can either be rows, i.e. l T A = λl T , or columns, i.e. Ar = λr, which are known as left eigenvectors and right eigenvectors, respectively. In both cases, the number λ is called the eigenvalue of the respective eigenvector. Lastly, it is worth noting that real matrices, including transition matrices, can have complex eigenvalues, i.e. λ ∈ C, in which case the corresponding eigenvector is also complex, i.e. v ∈ C N . However, such solutions can only occur in complex conjugate pairs, meaning that λ * and v * are also guaranteed to be an eigenvalue-eigenvector pair, where * denotes the complex conjugate.
A quick inspection of Equation (18) reveals that we have already encountered an example of an eigenvector in the case of a transition matrix: a stationary distribution π is a left eigenvector of P with eigenvalue λ = 1, normalized such that it sums to 1. Similarly, it is straightforward to see that η = (1, 1, ..., 1) T is a right eigenvector with eigenvalue λ = 1: so that P η = η. We can similarly describe the action of a matrix A on a set of k eigenvectors. Let Y R ∈ R n×k be a matrix whose columns are equal to k right eigenvectors A: Then the action of A on this matrix is simply: = λ 1 r 1 λ 2 r 2 · · · λ k r k If we had alternatively used a matrix Y L with rows equal to left eigenvectors of A, then an equivalent argument can show that Y L A = ∆Y L .
In linear algebra, one is often interested in finding linearly independent sets of eigenvectors. Without this condition, a set of eigenvectors can contain a lot of redundancy. For example, if r is an eigenvector of A with eigenvalue λ, then so too is the vector cr for any c ∈ C. Therefore, it is possible to construct an arbitrarily large set of eigenvectors using r alone, and the set of all possible vectors that can be formed in this way is known as the eigenspace of A corresponding to λ. If we instead look for a set of linearly independent eigenvectors of A, there can be at most N . When a set of N linearly independent eigenvectors exists, they form a basis for R N (i.e. the domain on which A acts). In this case, the set of eigenvectors is called an eigenbasis, and the matrix A is said to be diagonalizable. To justify the use of this latter term, imagine that the matrix Y R contains N linearly independent right eigenvectors. Then, by definition it is full rank, meaning that its inverse Y −1 R exists. If we then multiply both sides of Equation (37) by this inverse from the left, we get: Before interpreting this expression, it is worth reflecting on its general form. If two matrices B and C can be related via U −1 BU = C for some invertible matrix U , then they are said to be similar. Alternatively, C is said to be a similarity transformation on B, and since we could instead write B = U CU −1 the converse also holds. In words, similar matrices represent the same linear transformation, but expressed in two different bases, where the matrix U is known as the change-of-basis matrix. In the case of Equation (38), this means that A behaves like a diagonal matrix when acting on vectors that are expressed in its eigenbasis Y R , hence the term diagonalizable. Equation (38) can easily be rearranged to get the following expression for A: and since this only involves the eigenvalue and eigenvector matrices ∆ and Y R , it is known as the eigendecomposition of A. Furthermore, multiplying Equation (39) are a set of N linearly independent left eigenvectors of A. Therefore: where each term r ω l T ω in Equation (42) is a matrix of rank 1 and is known as the outer product between r ω and l ω . 1 Writing the matrix A in this way therefore gives a good insight into how a diagonalizable matrix can be partitioned into distinct modes.
Lastly, if l ω and r γ are left and right eigenvectors of A with distinct eigenvalues, i.e. λ ω = λ γ , then: Since (λ ω −λ γ ) = 0 by assumption, this means that l ω and r γ must be orthogonal. Thus, when all eigenvalues are distinct and the corresponding eigenvectors are normalized to have unit euclidean norm, the following 1 In component notation, the outer product between two vectors x = (x 1 , x 2 , · · · , x N ) T and y = (y 1 , y 2 , · · · , y N ) T is: relation between the columns of Y R and the rows of Y −1 R holds: Because of this, Y −1 R is sometimes called the dual basis of Y R and the two sets of vectors are collectively referred to as a biorthogonal system. It is worth noting that when a diagonalizable matrix has repeated eigenvalues, there is extra freedom in the choice of left and right eigenvectors. Consequently, for such matrices Equation (46) is not guaranteed to hold, however there exist certain choices of bases for which it does [33].
In the case of a Markov chain, these concepts can be applied when the transition matrix P is diagonalizable. In this case, any probability distribution µ(t) can be expressed in terms of the eigenbasis of P . Since we are treating distributions as row vectors, we do this using the left eigenvectors of P : where the components c ω are the coordinates of µ(t) in this basis. We can then use this to re-express the one-step evolution of the chain as: which can easily be extended to multiple steps of evolution: Comparing Equations (17) and (53), one can see that by working in the eigenbasis of P we have transformed the evolution of the Markov chain from a matrix multiplication to a scalar multiplication along each basis vector l ω by λ k ω . We have therefore improved our computational complexity from O(N 2 ) to O(N ). This type of manipulation can be done for any diagonalizable matrix, and does not depend on P being stochastic. However, one important result we present in Section 3.5 is that all eigenvalues of a transition matrix have absolute value less than or equal to 1. Using this fact, we can order the eigenvalues of P by their absolute value and separate the terms in Equation (53) corresponding to eigenvalues with |λ ω | = 1, and those corresponding to |λ ω | < 1: where m is the number of eigenvalues of P with |λ ω | = 1. In the long time limit (k → ∞) the terms with |λ ω | = 1 survive whereas those with |λ ω | < 1 die off, with |λ| measuring the rate of decay in the latter case. This allows us to interpret the first and second sums in Equation (54) to represent the persistent and transient behavior of the Markov chain, respectively. In the persistent case, we can partition the terms into three types based on the eigenvalues: (i) λ = 1, (ii) λ = −1, and (iii) λ ∈ C, |λ| = 1. We have already seen that a stationary distribution π and the vector η are left and right eigenvectors of type (i), and in both cases these eigenvectors represent fixed structures on the state space that persist when acted on by P . To capture this property, we call such eigenvectors persistent structures. In case (ii), eigenvectors flip their sign when acted on by the transition matrix, i.e. y T P = −y T , and return after two steps, i.e. y T P 2 = y T . Eigenvectors of this type therefore correspond to permanent oscillations of probability mass between states in S, and we therefore refer to them as persistent oscillations. Eigenvalues of type (iii) are explored in more depth in Section 3.5, where we show that they are always complex roots of unity, i.e. λ k = 1 for some k > 2. Therefore, when their corresponding eigenvectors are acted on repeatedly by P they are returned to after k steps, i.e. y T P k = y T . Thus, in analogy to (ii) they describe permanent cycles of probability mass through the state space S, and we therefore call such eigenvectors persistent cycles.

Persistent
Transient In the transient case, an analogous categorization based on the eigenvalues can be applied, consisting of the following three types: (i') λ ∈ [0, 1), (ii') λ ∈ (−1, 0), and (iii') λ ∈ C, |λ| < 1. For type (i'), the corresponding eigenvectors represent perturbations to the persistent behavior that decay over time, i.e. y T P = λy T where |λy| = |λ||y| < |y|. We therefore call such eigenvectors transient structures. When |λ j | ≈ 1 these structures describe sets of states that, on average, a chain spends a long time in before converging, which are known as metastable sets [34]. Furthermore, λ = 0 can be thought of as a limiting case of type (i') in the sense that any corresponding eigenvector decays infinitely quickly, i.e. P y = 0, and does not exhibit oscillatory or cyclic behavior. Eigenvalues of type (ii') also decay over time, but like case (ii) their negative sign means that the corresponding eigenvectors exhibit oscillatory behavior when acted on by P , i.e. y T P = −λy T where |λy| = |λ||y| < |y|. We refer to eigenvectors of this type as transient oscillations. Eigenvalues of type (iii') generalize those of type (iii) to the transient case, i.e. y T P k = λ k y T for some k > 2 where |λ k y| = |λ k ||y| < |y|, and we therefore call them transient cycles. When |λ| ≈ 1, these cycles can persist for a long time, and have been referred to as dominant cycles [34].   Table 1.
The above categorizations are summarized in Table 1, where structures, oscillations and cycles are colored in green, blue and red, respectively, and the persistent and transient cases are shaded bright and pale, respectively. In Figure 3, we visualize the six different types of eigenvalue using this color scheme by shading the respective regions of the unit circle in which they can occur.
Before moving on, a couple of details are worth pointing out. Firstly, the above analysis is possible only if the transition matrix P is diagonalizable. One case in which this is guaranteed to hold is for symmetric matrices. However, given that transition probabilities are rarely pairwise symmetric (i.e. P ij = P ji ), this restriction is clearly too strong. Thus, a more detailed investigation is needed in order to identify the conditions under which the above decomposition can be made, which we do in Section 3 and Section 4. For a more in-depth account of the theory of diagonalizable matrices, we recommend [35]. Secondly, a quick check reveals that all terms in Equation (54) are dependent on the components w j that describe the starting distribution (Equation (47)). Because of this, in the most general case both the persistent and transient behavior of a Markov chain can be sensitive to initial conditions. In a later section, we consider a particular type of Markov chain for which this is not the case, and provide a simplified analysis of their evolution over time.

Classification of states
In the following sections, we explore three types of Markov chains. In order to describe each type in detail, we first need to define various properties that apply to individual states, or sets thereof, in a state space S. This is the focus of the current section.

Communicating classes
We start by making the following definitions related to how states in S are connected: Definition 2.4.1 (Accessibility). For two states s i , s j ∈ S, we say that s j is accessible from s i , denoted s i → s j , when it is possible to reach s j from s i in k ≥ 0 steps, i.e. ∃ k : (P k ) ij > 0.
Communication is a useful property for describing states in a Markov chain, as exemplified by the following result: Proposition 2.4.3 (Communicating class). Communication is an equivalence relation, meaning that: • if s i ↔ s j and s j ↔ s k , then s i ↔ s k . and the state space S can be partitioned into communicating classes, each containing states that all communicate with one another.
Furthermore, we can make a useful categorization of Markov chains based on the number of communicating classes they have: Definition 2.4.4 (Number of communicating classes). Let n be the number of communicating classes of a Markov chain. When n = 1, the chain is said to be irreducible, otherwise it is reducible.
In words, an irreducible Markov chain is one in which for any pair of states there exists a connecting path in both directions. In Figure 4, three example Markov chains are shown, with the communicating classes indicated by the dashed boxes. Take a moment to double-check why each state belongs to its communicating class, and verify that the examples in Figure 4   This result can be applied to the example of Figure 4(c) by using the observation of the previous section that stationary distributions of a given Markov chain are eigenvectors of the associated transition matrix with eigenvalue 1. If we were to find the transition matrix of the chain in Figure 4(c) and compute its eigenvectors, we would indeed find a single left eigenspace of P corresponding to eigenvalue 1, with strictly positive elements. When normalized such that the row sum is 1, the resulting vector is π = (π 1 , π 2 , π 3 , π 4 , π 5 , π 6 ) T = (0.09, 0.09, 0.07, 0.31, 0.18, 0.26) T , which we encourage readers to check themselves. For reducible Markov chains, the guarantees of uniqueness and positivity of the stationary probabilities no longer hold. In order to describe the stationary distributions of such chains, we first introduce some new concepts for describing states.

Recurrence and transience
Each state s i ∈ S can be categorized based on how likely it is to be revisited, given that it is currently occupied. This is formalized by the following definition: Definition 2.4.6 (Recurrence/transience). Given that the system is in state s i initially, the probability of returning to s i is defined as: States for which f i = 1 are called recurrent, and those for which f i < 1 are called transient. As an illustration, we apply these concepts to the reducible chains depicted in Figure 4. In Figure 4(a), there are two communicating classes and in each one there is no possibility to exit. This means that if a state in one of these classes is occupied, it is guaranteed to be revisited at some future time step, i.e. f i = 1 for all states in each class. Therefore, both classes are recurrent and the chain as a whole is recurrent. In Figure 4(b), the main difference is that there is now the possibility to exit the blue class without returning. For example, assuming s 1 is occupied, although there is a possibility that s 1 can be visited again later (e.g. s 1 → s 2 → s 1 ), as soon as a transition s 1 → s 4 takes place this will no longer be possible. This is why f i < 1 for states in the blue class. Hence, while the red class is recurrent, the blue class is transient, and because of this the chain as a whole is not recurrent.
For an irreducible Markov chain, such as the example shown in Figure 4(c), every state is guaranteed to be recurrent, leading to the following proposition: However, from the example in Figure 4(a) it is clear that the converse does not hold, since it is also possible for a reducible chain to be recurrent. In fact, whether a reducible chain is recurrent or not determines certain features of the stationary distributions belonging to the chain. This is outlined by the following proposition: Proposition 2.4.9 (Stationary distributions of Reducible chains). For a reducible Markov chain with r recurrent classes and t transient classes: • Any stationary distribution has probability zero for states belonging to a transient class.
• For the k-th recurrent class, there exists a unique stationary distribution π k with non-zero probabilities only for states in that class.
• When the number of recurrent classes r is bigger than 1, stationary distributions can be formed via convex combinations of each distribution π k , i.e.
• Furthermore, when the number of transient classes t is zero, or in other words when the chain is recurrent, performing the procedure above with nonzero coefficients always yields distributions that are strictly positive.
We can apply these results to the two reducible chains in Figure 4. In the example of Figure 4(a), the stationary distribution associated to the blue class is π 1 = (0.35, 0.36, 0.29, 0, 0, 0) T and the one associated to the red class is π 2 = (0, 0, 0, 0.39, 0.26, 0.35) T . We can then generate an arbitrary number of extra stationary distributions by taking convex combinations of π 1 and π 2 with coefficients α 1 and α 2 . For example, with α 1 = 0.25 and α 2 = 0.75 we get π = (0.09, 0.09, 0.07, 0.29, 0.2, 0.26) T . Furthermore, the last bullet point in Proposition 2.4.9 tells us that since this chain is recurrent we can be sure that any convex combination with positive coefficients yields a distribution with strictly positive entries. This property of recurrent chains is particularly relevant to our treatment of both reversible chains in Section 2.6 and random walks in Section 4, and we henceforth use π > 0 to denote a stationary distribution with this property. In the example of Figure 4(b), the red class is the only recurrent class, meaning that the stationary distribution associated to this class is the only stationary distribution of the chain. Since the transition probabilities for states in this class are the same as in the example of Figure 4(a), this stationary distribution is π = π 2 . Furthermore, in agreement with Proposition 2.4.9, we see that the transient states in this chain have a stationary probability of 0.

Periodicity
The notion that states can be revisited is also meaningful in another sense. We define the following quantity, which describes how frequently such revisits can take place.
Definition 2.4.10 (Periodic chains). For each state s i ∈ S, the period is defined as: where gcd indicates the greatest common divisor. Then, Equation (57) says that when starting in state s i it is only possible to return to s i in multiples of the period d i . States for which d i > 1 are called periodic and those for which d i = 1 are aperiodic.
Like transience/recurrence, period is also a class property, and we use d to refer to the period of a whole class. This in turn allows us to define the period of a Markov chain: Definition 2.4.11 (Periodicity). When all communicating classes in S have period d > 1, the Markov chain is said to be periodic, with period d. Consider the example shown in Figure 5(a). This chain has two communicating classes, the red one being a transient class with d = 2 and the blue one being a recurrent class with d = 1, meaning that this chain has mixed periodicity. In Figure 5  Markov processes are a broad class of models, and even under the restricted settings considered in this tutorial (discrete time, homogeneous and finite state spaces), there are many distinct types of chains. In the following sections, we concentrate on three particular types that are relevant in applied domains.

Ergodic chains
When modeling a system that evolves over time, it is important to ask what can be said, if anything, about its long term behavior. For a Markov chain, this question can be phrased in two ways. On one hand, we can sample a single trajectory starting from some initial state and ask what the average behavior is over time, i.e. how often is it found in each state s i ∈ S for a trajectory of length t? On the other hand, we can describe our starting conditions as a distribution µ(0) and ask what this evolves to in the future, i.e. what is the probability of being in each state s i at a later time t? We refer to these two notions of long term behavior as the trajectory and distribution perspectives, respectively. While the analyses given so far predominantly use the latter perspective, we remind readers that in Section 2.1 we have introduced the idea of a distribution over S by taking the limit of an infinitely large ensemble of trajectories, meaning that the two concepts are closely related.
This two-way view originates from the field of statistical physics, where physical processes can either be analyzed with temporal averages (i.e. the trajectory perspective) or ensemble averages (i.e. the distribution perspective). One class of systems that has received a lot of study in this field are those for which these two types of averaging yield the same result as t → ∞. Such systems are known as ergodic systems, and this equivalence means that a statistical description of their long term behavior can be described simply by a single, sufficiently long sample. One implication of this is that initial conditions are forgotten over time, which makes ergodic systems particularly attractive from a simulation or modeling perspective. Finally, with this in mind we can define an ergodic Markov chain as follows: Definition 2.5.1 (Ergodic Markov chain). An ergodic Markov chain is one that is guaranteed to converge to a unique stationary distribution.
Clearly, in order for a chain to be ergodic it must have a unique stationary distribution. Therefore, by virtue of Theorem 2.4.5, a necessary condition for a chain to be ergodic is that it is irreducible. However, there is no guarantee that an irreducible chain converges, which is the second condition of Definition 2.5.1. The convergence of a Markov chain is related to its periodicity, as explained by the following result: Theorem 2.5.2 (Convergence of a Markov chain). The evolution of a Markov chain with period d can lead to a permanent, repeating sequence of d distributions, i.e.
which for d = 2 and d > 2 correspond to persistent oscillations and persistent cycles, respectively. In the case of an aperiodic Markov chain (d = 1), only sequences of length 1 are allowed, which means that one of the stationary distributions is guaranteed to be reached.
We can apply this result to the irreducible Markov chain in Figure 5(c). This chain has a unique stationary distribution π = (0.35, 0.15, 0.5) T and a period of d = 2. Therefore, there is no guarantee that the chain converges to π since it can get trapped in persistent oscillations. To observe this, we can try out different initial conditions and iteratively applying the update rule in Equation (12). For example, starting with the distribution µ(0) = (0.25, 0.5, 0.25) T we get a persistent oscillation between the following two distributions: µ(k) = (0.175, 0.075, 0.75) T and µ(k + 1) = (0.525, 0.225, 0.25) T . However, this is not the only persistent oscillation possible for this chain, which can be observed by trying out different initial conditions. Lastly, in Section 3.5 we gain more insight on Theorem 2.5.2 by using tools from graph theory to describe the eigenvectors of P with |λ| = 1.
A key insight from Theorem 2.5.2 is that a Markov chain is guaranteed to converge only if it is aperiodic. Together with irreducibility, this therefore provides the conditions under which a chain is ergodic:

.3 (Conditions for Ergodicity). A Markov chain is ergodic if and only if it is both irreducible
and aperiodic, which respectively ensure that (i) there is a unique distribution π, and (ii) the chain always converges to this distribution. Furthermore, the distribution π is said to be the limiting distribution of the chain.
Ergodic Markov chains have a number of beneficial properties. Firstly, as with any ergodic system, the initial conditions are eventually forgotten, which means that when studying the statistical behavior of an ergodic chain it is not necessary to explore different starting states. This advantageous property underlies a number of methods in machine learning and beyond, with Markov chain Monte Carlo methods being a particularly well known example [36]. Secondly, the equivalence between the trajectory and distribution perspectives allows us to interpret the limiting probabilities π i as the long run fraction of steps that the chain spends in each state s i [37]. Lastly, analyzing the evolution of a Markov chain in terms of the eigenvectors of its transition matrix becomes somewhat simpler in the case of an ergodic chain. We have already seen that when the transition matrix of a Markov chain is diagonalizable, we can vastly simplify the computation of the k-step evolution of the Markov chain (Equation (54)). When the Markov chain is also ergodic, we know that its persistent behavior is fully described by l 1 = π, meaning that this expression simplifies to:

Reversible chains
One of the defining characteristics of Markov chains is that the future (X) is conditionally independent of the past (Z), given the present (Y ). A simple calculation demonstrates that the relationship of conditional independence is symmetric. Assuming that Pr(X|Y, Z) = Pr(X|Y ), we find that: Assuming that we have an initial Markov chain X described by a transition matrix P , this symmetry tells us that reversing the direction of time produces a new process which itself satisfies the Markov property. Hence, we can define this new Markov chain as the time-reversal of X , which we denote asX . A natural question we can ask is how the transition matrix ofX is related to P . In order to answer this, we make the assumption that at time t the chain is described by one of its stationary distributions π. Therefore: We can make use of this, together with Bayes' theorem, to work out the transition probabilities (P rev ) ij of X [38]: A couple of reflections can be made about the result above. Firstly, since we initialized X to one of its stationary distributions, the time reversal relationship between X andX only holds once they have converged. Secondly, since we have π i in the denominator of Equation (66), the time reversal of a Markov chain is only valid for a starting distribution with π i > 0 ∀s i ∈ S. In Section 2.4.1, we have established that only recurrent chains have stationary distributions of this type, meaning that the time reversal of a Markov chain is only well-defined if the chain is recurrent. Lastly, using Equation (66) it is possible to define P rev in matrix notation as follows: Definition 2.6.1 (Time reversal). Let X be a recurrent Markov chain with transition matrix P and π > 0 one of its stationary distributions. Then, the transition matrix of its time reversalX is given by: It is worth emphasizing that since no assumption is made in Definition 2.6.1 about the number communicating classes, it also applies in the case of reducible recurrent chains where there are an infinite number of distributions π > 0 to choose from. However, in such cases the choice of π makes no difference: Proposition 2.6.2 (Time reversal of reducible chains). For a reducible recurrent Markov chain X , the time reversalX is uniquely defined, with P rev being independent of which stationary distribution π > 0 is used (proof: see Appendix A).
Moreover, the set of stationary distributions π > 0 belonging to a recurrent Markov chain is the same as the set belonging to the corresponding time reversal: Proposition 2.6.3 (Stationary distributions of time reversal). Let X be a recurrent Markov chain. Then π > 0 is a stationary distribution ofX if and only if it is a stationary distribution of X . 2 We now consider the special case where a Markov chain X is indistinguishable from its time reversal X . Such processes are known as reversible Markov chains, since in any stationary distribution the forward and backwards dynamics of the chain are statistically equivalent, i.e. any trajectory X 1 , X 2 , ..., X k−1 , X k occurs with equal probability as the corresponding reversed trajectory X k , X k−1 , ..., X 2 , X 1 . The stationary dynamics of such chains therefore has no inherent arrow of time. Furthermore, since X andX are indistinguishable, they have the same transition matrix, i.e. P = P rev , and for any stationary distribution π > 0 the forward and backwards flow matrices are the same. Therefore: By expressing Equation (70) in component form, we arrive at the following theorem which is used throughout the rest of the tutorial: Theorem 2.6.4 (Detailed balance). A recurrent Markov chain is reversible if and only if it for any stationary distribution π > 0: which are known as the equations of detailed balance.
A number of observations can be made about this definition. Firstly, the left (right) terms represent the flow of probability from s i to s j (s j to s i ), given that the chain is described by distribution π. Thus, for a reversible Markov chain in one of its stationary distributions, the flow from one state to another is completely balanced by the flow in the reverse direction, meaning that the flow matrix F π is always symmetric for such chains. By comparison with Equation (19), we see that detailed balance is a stronger condition than global balance, since in the latter case there is only an equivalence between the total flow in and out of each state. Secondly, since π i and π j are non-zero, it follows that P ij = 0 if and only if P ji = 0. Thus, the transition structure of a reversible Markov chain always permits the return to the previous state, and because of this the period of a reversible Markov chain can be at most 2. Thirdly, while some sources assume irreducibility as a precondition of reversibility, we instead base our definition on the weaker condition of recurrence [37]. This is due to the fact that we only need recurrence in order to define the time reversal of a Markov chain. Furthermore, with this convention Theorem 2.6.4 applies more broadly to reducible Markov chains, which lets us make a closer comparison between reversible Markov chains and undirected graphs in Section 4. Lastly, Theorem 2.6.4 implies that there are two distinct ways in which Markov chains can be non-reversible: (i) they can be recurrent without satisfying detailed balance, or (ii) they can be non-recurrent. In the case of (i), ΠP = P T Π for any distribution π > 0, meaning that the flow matrix F π is asymmetric, and in the case of (ii) no positive stationary distribution exists. Lastly, note that for a non-recurrent chain, there exists the possibility that F π is symmetric for all stationary distributions despite none of those distributions  being strictly positive. For such chains, removing all transient states from S produces a reversible chain. We therefore refer to such chains as semi-reversible.
To illustrate some of these points, we consider three Markov chains in Figure 6, with (a, d, g, j) showing the transition graphs of each example, and (b, e, h, k) showing the respective transition matrix, stationary distribution and flow matrix (for simplicity, each example has a single recurrent class, so that both the stationary distribution and the associated flow matrix are unique). As a visual illustration of the pairwise stationary flow between states in each example, in (c, f, i, l) the stationary distribution is represented as a bar plot, with the portions of probability mass flowing in (left) and out (right) of each state shown as portions of each bar. The example in (a) is reversible, as can be seen by the symmetry of F π 1 in (b), or equivalently by the matching between left and right portions of all bars in (c). The example in (d) is almost equivalent to the one in (a), except that the outgoing transition probabilities from s 1 have been slightly modified (indicated by the colored arrows in (a, d) and the colored entries of P 1 and P 2 in (b, e)). This modification is enough to violate detailed balance, as can be seen by the asymmetry of F π 2 or the bar plot in (f). Lastly, the chains depicted in (g, j) are non-recurrent since in both cases state s 3 has only outgoing transitions. Therefore, π 3 = 0 and both examples are non-reversible. However, a quick check of F π 3 or the bar plot in (i) reveals that the example in (g) is semi-reversible. Conversely, the example in (h) is identical to the one in (g) except for the outgoing transitions from s 4 (again indicated by the colored arrows in (g, j) and the colored entries of P 3 and P 4 in (h,k)), which leads to an asymmetric stationary flow between the recurrent states.
It is worth pointing out that in our analysis above, reversibility was checked by inspecting the stationary distributions and the corresponding flow matrices of each example. However, since reversibility is a property associated to Markov chains and not to distributions, one might wonder whether there is an alternative way to formalize it based purely on the transition probabilities P ij . Clearly, Equation (71) prohibits one way transitions (i.e. P ij > 0 and P ji = 0), but this is only a necessary condition of reversibility -can we offer anything more precise? Fortunately, the answer is yes, and it is given by Kolmogorov's criterion [39]: Theorem 2.6.5 (Kolmogorov's criterion). A recurrent Markov chain is reversible if and only if the product of one-step transition probabilities along any finite closed path of length more than two is the same as the product of one-step transition probabilities along the reversed path. In other words: for any n ≥ 2 and any sequence of states s i1 , s i2 , s i3 , ..., s in−1 , s in ∈ S.
One way to understand this theorem is that for reversible Markov chains, the probability for traversing any closed path in the state space S is independent of the direction of traversal. Hence, reversible Markov chains can be thought of as having zero net circulation. By contrast, recurrent Markov chains that are non-reversible have at least one path that violates Equation (72), over which there is a higher probability to traverse in one direction than the other. For the example in Figure 6(a), the relevant closed paths are (up to a cyclic permutation): In any of these cases, going around clockwise is equally probable as going around anticlockwise, which is to be expected since this Markov chain is reversible. The example in Figure 6(d) has the same closed paths available, except that the outgoing transition probabilities from s 1 have been changed. This small adjustment is enough to introduce circulation on all the closed paths: for both (i) and (iii) the anticlockwise direction is more probable since P 13 P 34 P 42 P 21 > P 12 P 24 P 43 P 31 and P 13 P 34 P 41 > P 14 P 43 P 31 , respectively, and for (ii) the clockwise direction is more probable since P 12 P 24 P 41 > P 14 P 42 P 21 . Therefore, by virtue of having at least one path with net circulation, Equation (72) confirms that this chain is indeed non-reversible. The foregoing analyses illustrate how Theorem 2.6.4 and Theorem 2.6.5 provide two alternative but equivalent definitions of reversibility. Something common to both of these interpretations is that reversible Markov chains satisfy a type of equilibrium, either between the exchange of probability mass between pairs of states or the circulation along closed paths, respectively. In fact, the concept of detailed balance stems from early work in the field of statistical mechanics aimed at formalizing the notion of thermodynamic equilibrium on a microscopic level [40]. More recently, Markov chain Monte-Carlo methods, which are predominantly based on reversible ergodic chains, have received widespread application in the natural sciences as a way to model systems that are in thermodynamic equilibrium [36]. Conversely, Markov chains that violate detailed balance, or equivalently those with net circulation, have been applied to the less well understood case of systems which are out of equilibrium [41,42,43]. Furthermore, their stationary distributions have been referred to as non-equilibrium steady states (or NESS) [41,42,43,34,44], which reflects the fact that such distributions are kept fixed over time via unequal flows of probability mass between states (π 2 is an example of a NESS, as can be seen in the bar plot in Figure 6(f)).
Reversible Markov chains are significantly easier to treat both analytically and numerically than nonreversible chains. Because of this, there exist various procedures for modifying a non-reversible Markov chain so that it becomes reversible, which is sometimes referred to as reversibilization [45,38]. For a recurrent chain, this can be done by taking an average of the forward and backwards transition probabilities, P and P rev , that describe the chain and its time reversal, respectively. This averaging process can be either additive or multiplicative, leading to the following two definitions: Definition 2.6.6 (Additive Reversibilization). Let X be a recurrent non-reversible Markov chain with transition matrix P and a stationary distribution π > 0. Then the additive reversibilization of X is a chain with the following transition matrix: and for which π is also a stationary distribution.
Definition 2.6.7 (Multiplicative Reversibilization). For a non-reversible Markov chain with transition matrix P and a strictly positive stationary distribution π, the multiplicative reversibilization produces a Markov chain described by the following transition matrix: and for which π is also a stationary distribution.
Since both definitions produce chains that have the same set of stationary distributions as the starting chain, a simple way to verify their reversibility is to calculate the flow matrix for the distribution π. For the additive reversibilization this gives: and for the multiplicative reversibilization: both of which are by definition symmetric. While Equation (84) does not admit a simple interpretation, Equation (80) says that for the additive reversibilization the flow matrix is symmetric because it corresponds to an average of the forwards flow, F π , and backwards flow, (F π ) T , of the starting chain X . This interpretation is used when we consider random walks on directed graphs in Section 4.3.

Absorbing chains
Finally, one concept in the theory of Markov chains that is particularly relevant to applied domains is absorption. A state s i ∈ S is called absorbing if it is possible to transition into the state but not out of it, meaning that P ii = 1 and the chain stays in s i for all future time steps. An absorbing Markov chain is one for which from every state s i ∈ S there exists some path to an absorbing state. Since it is possible to start in a non-absorbing state and never return, all non-absorbing states are transient, and the presence of such states means that absorbing chains can be neither reversible nor ergodic. Absorbing chains often occur in Markov Decision Processses (MDPs), which are central to the field of reinforcement learning [32].
where Q ∈ R t×t , R ∈ R t×r and 1 ∈ R r×r describe transitions of type (i), (ii), and (iii), respectively, and 0 ∈ R r×t is a matrix of zeros. Thus, matrix Q is what remains of P when we remove any absorbing states from S.
We depict this partitioning of transition probabilities in Figure 7(a). An absorbing chain with one absorbing state is shown, with the transitions belonging to Q, R and 1 colored black, red and blue, respectively.
Furthermore, in Figure 7(b) we show the matrices Q and R. Since any transient state can reach an absorbing state in a finite number of steps, the probability that the chain ends up in an absorbing state at some future time is 1. For this reason, in the infinite time limit we can expect to see no transitions taking place between transient states, i.e. lim n→∞ Q n = 0. This is an advantageous property, since it means that if we sum up all powers of Q, known as the Neumann series of Q, then the contributions for larger powers get progressively smaller and the sum converges to (1 − Q) −1 (see [35] p. 618). Calculating this sum for Q leads to the following useful quantity which relates transient states in S [37]: Definition 2.7.2 (Fundamental Matrix). For any absorbing Markov chain, the Neumann series of matrix Q is given by: and is known as the fundamental matrix of the Markov chain. The elements of this matrix N ij give the expected number of times the chain visits a transient state s j before absorption, given that the chain started in a transient state s i .
When analyzing an absorbing chain, it is very handy to have access to the fundamental matrix. By taking into account all non-negative powers of Q, it contains information about all possible paths available between pairs of transient states. Because of this, it is a useful predictive tool that allows several properties of the Markov chain to be deduced [37]. Furthermore, in the field of reinforcement learning, it is closely related to the successor representation [46]. This concludes our exploration of different types of Markov chains. In Figure 8, we provide a summary of the material presented in this section in the form of a Venn diagram. In this diagram, each type of Markov chain is drawn as a circle or ellipse, with defining properties/results listed in each case. Take a moment to look at this image and pay attention to the overlapping regions which indicate how different types of chains are related. Furthermore, for a more in-depth presentation of the material in this section, we recommend [37] and [38].

Summary
In the next section, we introduce graphs as an alternative way to describe Markov chains and summarize insights that emerge from this description. Then, in Section 4 the connection between graphs and Markov chains is explored in more depth using the notion of random walks, which allows various relationships to be made between specific types of graphs and some types of Markov chains introduced in this section.

Graphs
So far, we have implicitly been interpreting Markov chains as graphs whenever we draw a transition graph. In the current section, we formally introduce the concept of graphs, which provides a foundation to the material on random walks in Section 4. Readers should note that definitions in graph theory often vary between different sources. Here we use a convention that can encompass a wider variety of graphs, thereby offering greater generality.

Definition
A graph G = (V, E) is a set of N vertices V = {v 1 , v 2 , · · · , v N } together with an edge set E containing pairs of vertices in V . Conceptually, V might represent a collection of objects, and E a specification of how some pairs in this collection are related to one another. A natural way to categorize graphs is based on the way in which edges are defined. For instance, in an undirected graph each edge has no direction and is typically denoted as (v i , v j ) ∈ E, whereas in a directed graph each edge has a specified starting and ending vertex and is usually denoted as (v i → v j ) ∈ E. Examples of undirected and directed graphs can be seen in the first and second rows of Figure 9, respectively. Unless otherwise stated, we depict undirected edges as straight lines and directed edges as curved lines with arrowheads indicating the direction. A second distinction we can make is between unweighted graphs, in which one only cares about whether two vertices are related or not, and weighted graphs, in which each edge has a positive weight w ij describing the strength of the relationship. 3 In Figure 9, the examples in the first column are unweighted and all other graphs are weighted, with weights indicated by numbers next to each edge. The type of edges that a graph has is often chosen based on the type of relationship that one wants to describe. For example, assume that we have a graph G where vertices represent PhD students. Then, if we want to represent the relationship of being in the same research group, undirected unweighted edges are a natural choice (such as Figure 9(a)). Conversely, if we want edges to describe whether one student has participated on a main project of another student, then this clearly requires directed unweighted edges (such as Figure 9(d)). If we now consider variants of the first and second examples, instead focusing on how similar the research topics of two students are, or how much work has one student has contributed to another student's project, then we now need undirected weighted and directed weighted edges, respectively (such as Figure 9(b,c) and Figure 9(e,f)). It is worth noting that in order to assign weights to edges, one needs to specify a scale on which to measure the strength of relationships between vertices.
One can also describe graphs based on their connectivity. In an undirected graph, if there exists a path between each pair of vertices then the graph is said to be connected, otherwise it is disconnected. The notion of connectivity can generalize to disconnected graphs if we instead consider subsets of vertices in G, which are known as subgraphs. Any subgraph that is connected but is not part of any larger connected subgraph is called a connected component. Both of the undirected graphs in Figure 9(a, b) are connected, whereas Figure 9(c) shows an example that is disconnected, with two connected components. In particular, this latter example even has a vertex that does not have any edges at all, which is known as an isolated vertex. For a directed graph, if there are directed paths running from v i to v j and from v j to v i for all pairs of vertices v i , v j ∈ V then the graph is said to be strongly connected. Alternatively, a directed graph is weakly connected if for all pairs of vertices v i , v j ∈ V it is possible to get from v i to v j and from v j to v i by any path, regardless of the direction of edges. Clearly, a directed graph is weakly connected if it is strongly connected, but not vice versa. Furthermore, strongly or weakly connected subgraphs that are not part of any larger such subgraphs are referred to as strongly or weakly connected components, respectively. The directed graphs in Figure 9(d,e) are strongly connected, whereas the one in Figure 9(f) is only weakly connected and has a two strongly connected components (take a moment to verify this).

Matrix representation
A natural way to numerically represent graphs with |V | = N vertices is with a N × N matrix. In the unweighted case, this matrix is a binary matrix A, with entries: when it is directed. Furthermore, the matrix A is usually referred to as the adjacency matrix of G. This extends easily to the weighted case,  where instead we have a non-negative matrix W , with entries: which is often referred to as the weight matrix of G. In Figure 9, the relevant matrix is shown below each graph, and we encourage readers to verify that they match in each case. Furthermore, something worth pointing out is that undirected graphs always have corresponding matrices that are symmetric, as can be seen in Figure 9(a-c). For the remainder of this tutorial, we assume that the graphs we deal with are both weighted and directed. The reason we choose this convention is that it is more general. On one hand, any unweighted graph can be considered as a special case of a weighted graph where the weights are all set to 1. Thus, we henceforth only talk about weight matrices as opposed to adjacency matrices when describing graphs numerically. On the other hand, there is one sense in which directed graphs can be thought of as a generalization of undirected graphs. If v i and v j are two distinct vertices that share an edge in an undirected graph, then the weight of this edge is guaranteed to appear twice in the weight matrix W by virtue of it being symmetric. If instead these vertices belong to a directed graph and there is an edge (v i → v j ), then this edge appears only once in W . Therefore, it is possible to interpret an undirected edge between v i and v j as being equivalent to a pair of directed edges of the same weight, with one connecting v i to v j and the other connecting v j to v i . This is the interpretation we use throughout the rest of the tutorial whenever we refer to undirected graphs. As an example, in Figure 10 we show two equivalent depictions of an undirected graph, with the top image drawn in the usual way, and the bottom image drawn using pairs of directed edges. Below these two drawings, the weight matrix of this graph is shown. One must note that interpreting undirected graphs in this way is somewhat atypical, however it allows us a greater level of generality when dealing with different types of graphs in Section 4. Furthermore, this interpretation only applies to edges between distinct vertices, and edges that connect vertices to themselves are discussed in Section 3.4.

Vertex degrees
Once the weight matrix of a graph is known, it is easy to calculate the total weight coming in and out of each vertex. The total incoming weight of a vertex v i can be found by summing over the i-th column of W , and is known as the in-degree of v i , i.e. d − i := N j=1 W ji . Conversely, the total outgoing weight of v i is calculated by the sum over the i-th row of W , i.e. d + i := N j=1 W ij , and is known as the out-degree of v i . Since undirected graphs always have bidirectional edges and symmetric weight matrices, the in-and out-degrees of such graphs are always equal, and are simply referred to as vertex degrees, denoted by d i .
N j=1 W ij , which is sometimes referred to as the volume of G. As an example, consider the directed graph in Figure 9(e): each vertex has a different inand out-degrees, i.e. d − , d − 4 = 2 and d + 4 = 1.8, but summing over either of the degree types yields vol(G) = 6.3. Nonetheless, some directed graphs can have d + i = d − i for each vertex, and such cases are known as balanced graphs [48,49]. In keeping with the notation of undirected graphs, we denote the vertex degrees of a balanced graph as d i = d + i = d − i . An example of a balanced graph along with its corresponding weight matrix is shown in Figure 11, and a quick check reveals that summing over the rows or columns of W indeed yields the same values. Just as a balanced graph is a special case of a directed graph, we can similarly say that an undirected graph is a special case of a balanced graph, and this interpretation is important in Section 4.

Self-loops
In the examples considered so far, all edges connect pairs of vertices that are distinct, i.e. v i = v j . While this is sometimes enforced as a rule, some conventions also allow edges to connect vertices to themselves, which are known as self-loops. For undirected graphs, the standard convention is that a self-loop at vertex v i counts doubly to the vertex degree d i , while other edges only count singly, i.e. d i = 2 × W ii + j =i W ij . This somewhat counter-intuitive property is typically demonstrated using the degree sum formula [50]. For undirected graphs, this states that each edge contributes twice its weight to the volume. Since self-loops only involve a single vertex, the only way that this rule can be respected is if they count twice as much to the vertex degrees as other edges. A property that we require when dealing with undirected graphs in Section 4 is that the vertex degrees are calculated by the row sums of W . Clearly, this property is violated by the factor of 2 that applies to undirected self-loops. As a result, in this tutorial we assume that self-loops are always directed, regardless of whether they occur in undirected or directed graphs. This is an atypical definition, since undirected graphs typically are not allowed to have directed edges. However, as can be seen from the examples in Figure 12, this preserves the fact that undirected graphs have symmetric weight matrices, whereas directed graphs have non-symmetric weight matrices, which is sufficient for the scope of this tutorial.
We close this section by noting some similarities between our definitions of Markov chains and graphs. Firstly, the transition matrices of Markov chains, like the weight matrices of graphs, are non-negative. Secondly, in a directed graph any entry W ij = 0 of the weight matrix describes an outgoing edge from vertex v i to v j , and analogously any entry P ij = 0 of a transition matrix describes an outgoing transition probability from s i to s j . Putting these together, we see that in the most general sense any Markov chain can be thought of as a directed graph, with P being the associated weight matrix. Indeed, this interpretation is precisely what justifies us in visualizing a Markov chain by its transition graph. In the next section, we present some useful results that emerge as a result of this way of thinking about a Markov chain. Lastly, for a comprehensive text on graph theory that covers much of the material in this section, we recommend [50].

Eigenspaces of transition matrices
Non-negative matrices have received widespread attention in mathematics, and in particular their eigenvalues and eigenvectors are the focus of spectral graph theory [51]. In this section, we apply some results from this field to transition matrices, considering first irreducible chains and then subsequently exploring the generalization to reducible chains.

Irreducible chains
A fundamental result used in spectral graph theory is the Perron-Frobenius theorem, and while a full treatment of it is beyond the scope of this tutorial, we now summarize its key implications for transition matrices of irreducible Markov chains. Theorem 3.5.1 (Perron Frobenius theorem for irreducible Markov chains). If P is the transition matrix of an irreducible Markov chain, then: • λ = 1 is guaranteed to be an eigenvalue.
• λ = 1 is simple eigenvalue, meaning that it occurs only once.
• All other eigenvalues have |λ| ≤ 1, where | · | is the complex modulus, meaning that the spectral radius of P is 1. 5 To illustrate the above theorem, in Figure 13(a-c) we show the transition graphs and eigenvalue plots of three irreducible Markov chains. The first observation to make is that, in agreement with Theorem 3.5.1, λ = 1 is an eigenvalue in each case and occurs only once. Furthermore, as a quick exercise we encourage readers to find the eigenvectors of λ = 1 for each example and normalize them to obtain π and η. Lastly, the eigenvalue plots show that in each example all eigenvalues indeed lie either on the unit circle (|λ| = 1), or within it (|λ| < 1). Other than λ = 1, Theorem 3.5.1 implies that the following two types of eigenvalues are also possible: (i) those that lie on other points of the unit circle (i.e. |λ| = 1, persistent), and (ii) those that lie within the unit circle (i.e. |λ| < 1, transient). In either case, Equation (45) tells us that if l ω is a corresponding left eigenvector, then it must be orthogonal to the unique right eigenvector with λ = 1, which is η. Therefore: where l ω,i denotes the i-th component of l ω . Consequently, left eigenvectors with λ = 1 sum to zero, meaning that unlike stationary distributions they are not probability vectors. Figure 13, only (a) and (b) have eigenvalues of this type, and in both cases they are complex conjugate pairs describing transient cycles. Looking at the transition probabilities in each example, it is clear that these transient cycles flow clockwise around the state space.

Using our terminology from Section 2.3, case (i) includes transient structures, transient oscillations and transient cycles. Of the irreducible chains in
Case (ii) on the other hand includes persistent oscillations and persistent cycles. Theorem 2.5.2 tells us that these are only possible when a chain is periodic. The following result sheds light on this by relating the eigenvalues with |λ| = 1 to the period of a chain [52]: Proposition 3.5.2. If P is the transition matrix of an irreducible Markov chain with period d, then there are d distinct eigenvalues with modulus 1, given by: In simple terms, Proposition 3.5.2 says that the eigenvalues of P with modulus 1 are always d-th roots of unity. We can verify this by checking the periodic examples in Figure 13(b,c). In both case the number of eigenvalues on the unit circle is indeed equal to the period of the chain and they are also equally spaced. Furthermore, Proposition 3.5.2 offers an alternative perspective on how the periodicity affects the persistent behavior of a Markov chain. For example, the chain in Figure 13(a) has only a single vector on the unit circle, corresponding to its unique stationary distribution. It is therefore guaranteed to end up in this distribution since all other eigenvalues have |λ| < 1. This is equivalent to the statement that this chain is ergodic, which a quick check of the transition graph confirms. Conversely, the chain in Figure 13(b) has an additional eigenvalue λ = e πi = −1 on the unit circle, by virtue of the fact that it has period 2. Therefore, its persistent behavior can only be fully described using both the unique stationary distribution π and the eigenvector y associated to λ = −1. For example, we know from Theorem 2.5.2 that such a chain can get trapped in a persistent oscillation (i.e. µ 1 → µ 2 → µ 1 → µ 2 → ...). For any such oscillation, µ 1 and µ 2 can always be expressed as a linear combination of π and y, meaning that this sequence indeed oscillates between two points in the space spanned by these eigenvectors. While this example only involves real eigenvalues and therefore only real eigenvectors, the interpretation extends to d > 2, for which Proposition 3.5.2 tells us that there must be complex eigenvalues with |λ| = 1. For example, the chain in Figure 13(c) has period d = 3, and it has the following 3 eigenvalues on the unit circle: λ 1 = 1, λ 2 = e 2πi 3 and λ 3 = e 4πi 3 . Analogous to the d = 2 case, any persistent cycle of this chain can be expressed using the three corresponding eigenvectors, which in the case of λ 2 and λ 3 must have complex entries. Rather interestingly, this means that for chains with period d > 2, persistent cycles are cycles in a complex space despite being sequences of real valued distributions.

Reducible chains
Applying spectral graph theory to the transition matrices of reducible Markov chains produces a weaker set of results. For example, the generalization of Theorem 3.5.1 to the reducible chains is the following: • λ = 1 is guaranteed to be an eigenvalue.
• The number of linearly independent eigenvectors with λ = 1 is equal to the number r of recurrent communicating classes in the Markov chain.
• There are many choices of left and right eigenvectors for λ = 1. However, a convenient choice which mirrors the irreducible case is to choose π k and η k as a pair of left and right eigenvectors for each of the recurrent communicating classes, with π k being the unique stationary distribution associated to each class and η k being an indicator vector with entry 1 for states in this class and zeros elsewhere.
• All other eigenvalues have |λ| ≤ 1, where | · | is the complex modulus, meaning that the spectral radius of P is 1.
To understand Theorem 3.5.3 in more depth, consider the example in Figure 13(d). This Markov chain has 2 recurrent communicating classes, which means that λ = 1 has a multiplicity of 2. We indicate this on the eigenvalue plot by a larger size circle. Furthermore, we color this circle half red and half blue to reflect the fact that we can choose the eigenvectors for λ = 1 based on the two recurrent communicating classes, i.e. π 1 = (0.58, 0.42, 0, 0) T and η 1 = (1, 1, 0, 0) T as a pair of left and right eigenvectors for the red class, and π 2 = (0, 0, 0.43, 0.57) T and η 2 = (0, 0, 1, 1) T as a pair of left and right eigenvectors for the blue class. Then, π 1 and π 2 together span the left eigenspaces of λ = 1 (including all possible stationary distributions), whereas η 1 and η 2 span the right eigenspaces of λ = 1. It is worth emphasizing that while there are an infinite number of other ways to choose the basis vectors for λ = 1, this is a convenient choice since it is the only one for which all basis vectors have strictly non-negative entries. For example, consider the choice ofπ 1 = 1 3 π 1 + 2 3 π 2 = (0.19, 0.14, 0.29, 0.38) T andη 1 = η = (1, 1, 1, 1) T as a first pair of eigenvectors. If we then choose a second pairπ 2 andη 2 that satisfy biorthogonality (Equation (46)) within this space, they are guaranteed to contain negative entries, for exampleπ 2 = π 1 − π 2 = (0.58, 0.42, −0.43, −0.57) T and η 2 = 3η 1 − 3 2 η 2 = (3, 3, − 3 2 , − 3 2 ) T . Thus, the choice of eigenvectors stated in Theorem 3.5.3 is in some sense special since it is the only one that preserves our intuition that left eigenvectors with λ = 1 correspond to distributions over the state space S. For this reason, we henceforth assume this convention when referring to eigenvectors with λ = 1.
Unfortunately, there is no general analogue of Equation (91) for reducible chains. One reason for this is that, like the λ = 1 case, there are many choices of eigenvectors for λ = 1. However, for a recurrent reducible chain the transition matrix can be written in block diagonal form (see proof of Proposition 2.6.2), which means we can mirror the λ = 1 case by choosing eigenvectors with λ = 1 to have non-zero entries only in a single recurrent class. If the k-th recurrent class has n states, then the corresponding block is a n × n matrix, meaning that there are a total of n pairs of left and right eigenvectors with non-zero entries for states in this class, i.e. one pair with λ = 1 (π k and η k ), and another n − 1 pairs with λ = 1. We can therefore apply an equivalent argument to Equation (91) for the k-th class, but instead using the vector η k . Thus, for each recurrent class, the left eigenvectors with λ = 1 can also be chosen such that they all sum to zero. Conversely, non-recurrent chains cannot be written in block diagonal form, which means that this argument cannot be applied. Therefore, some eigenvectors will not sum to zero for such chains, although to our knowledge this case has not received significant attention so far in the literature.
Looking at the eigenvalue plot of Figure 13(d), we see that there are two eigenvalues with |λ| < 1, both of which are real and negative. Using the terminology from Section 2.3, they therefore correspond to transient oscillations of the chain. Furthermore, since the chain is recurrent we can apply the procedure described above and choose one eigenvector to have non-zero entries only in the red class, and the other eigenvector to have non-zero entries only in the blue class. With this choice, we see that each transient oscillation takes place on a distinct communicating class, which we indicate on the plot by coloring the eigenvalues red and blue.
In the case of Proposition 3.5.2, the extension to reducible chains is straightforward since one can simply apply this result individually to each recurrent communicating class of a reducible chain. Therefore, for each class of period d, there are d eigenvalues of modulus 1 that satisfy the same properties as in the irreducible case.
Lastly, a couple of similarities between Theorem 3.5.3 and Theorem 3.5.1 can be pointed out. Firstly, in both theorems λ = 1 is guaranteed to be an eigenvalue. Since every eigenvalue has at least one eigenvector, this means that we can always find a left eigenvector with λ = 1. Provided that we choose an eigenvector with non-negative entries and normalize it to one, it is a stationary distribution of the chain. This therefore justifies our claim from Section 2.2 that every finite Markov chain has at least one stationary distribution. Secondly, in both theorems the eigenvalues cannot have absolute value greater than one, which is one of the assumptions we made when studying the evolution of a chain in terms of its eigenvectors and eigenvalues in Section 2.3, and which justified our partitioning of Equation (54) into persistent and transient terms.
The results of this section emerge by treating Markov chains as graphs. However, in most graphs the outgoing edges from each vertex do not sum up to 1, meaning that they cannot be interpreted as transition probabilities. Because of this, Markov chains can be more precisely interpreted as a type of normalized graph. This idea is formalized in the next section, where we introduce a well-known method for transforming any graph G into a Markov chain.

Random walks 4.1 Definition
Suppose we have a graph G with weight matrix W that we want to normalize such that the outgoing weights from each vertex v i sum up to 1. 6 The most obvious way to do this is to divide all entries in the i-th row of W by d + i . By scaling each edge weight W ij by the out-degree of the starting vertex v i , we obtain the transition probabilities P ij = Wij d + i . In order to write this in matrix notation, we first define a degree matrix D whose elements are given by: Since d + i > 0, we can invert D to produce a diagonal matrix with the entries 1 d + i on the diagonals. Using this inverse, W can be row normalized by multiplying with D −1 from the left: The Markov chain represented by Equation (94) is called the random walk on G [53,54]. This is a fitting name, since if we imagine an agent walking between vertices in G, and randomly choosing where to go at each time step based on the weights of outgoing edges, then Equation (94) would describe the resulting Markov chain. Qualitatively, we can say that the transformation in Equation (94) is useful when we have a starting graph G that we would like to describe in probabilistic and/or temporal terms. Conversely, if we have a starting Markov chain and transition matrix P , knowing some matrix W for which Equation (94) holds can offer insight into the type of relationships between states that give rise to the chain. However, the latter of these two perspectives is partially complicated by the fact that the mapping from P to W is one-to-many, and there are in fact an infinite number of different graphs G which produce the same Markov chain. As an example, in Figure 14 we show two distinct weight matrices W 1 and W 2 that get transformed to the same transition matrix P using Equation (94). How can we describe the infinite set of graphs corresponding to a single Markov chain? In principle, it involves undoing the row normalization of Equation (94). Thus, for a given Markov chain X with transition matrix P , we consider all possible scalings of the rows of P by positive constants. Every such scaling can be described by a diagonal matrix A ∈ R N ×N >0 that multiplies P from the left to produce a single corresponding weight matrix, i.e. W = AP . As an illustration, in Figure 14(d, e) we show the two scaling matrices that undo the row normalization of the transition matrix in Figure 14(c) and transform it back into the weight matrices W 1 and W 2 , respectively. The following definition generalizes this to the set of all such weight matrices that can be realized in this way: Definition 4.1.1. For a Markov chain X defined on a state space with |S| = N states and with a transition matrix P , the following defines the set of all graphs from which the chain can be generated via a random walk: is a positive diagonal matrix} (95) which we call the random walk set of X .
A few details are worth noting about Definition 4.1.1. Firstly, since the trivial scaling A = 1 is allowed, P ∈ RW(X ) for any Markov chain. Secondly, the fact that A ∈ R N ×N >0 means that each of these matrices is invertible, with A −1 also being diagonal and having entries equal to the reciprocals of the diagonals of A. Consequently, if W 1 = A 1 P and W 2 = A 2 P are two weight matrices in the random walk set of a given Markov chain, we can always write W 1 = A 1 A −1 2 W 2 = A 3 W 2 , meaning that W 1 and W 2 are also related simply by a row scaling with positive constants, with A 3 = A 1 A −1 2 describing this scaling. Therefore, the row scaling defined in Equation (95) effectively partitions the set of all non-negative matrices into equivalence  Figure 14: (a, b) The weight matrices of two graphs G 1 and G 2 that have edges connecting the same pairs of states but with different weights, (c) the transition matrix of the Markov chain produced via a random walk on either of these graphs, (d, e) the scaling matrices that transform P into W 1 and W 2 , respectively.
classes. Lastly, since Definition 4.1.1 allows any non-zero scaling of the rows of P , the random walk set of any Markov chain predominantly consists of graphs which are neither undirected nor balanced. In fact, only certain types of Markov chains have random walk sets that contain undirected or balanced graphs, as explained by the following two results:  As an illustration, in Figure 15(a-c) we show the random walk sets for the three Markov chains that were studied in Figure 6(a, d, j) of Section 2.6. 7 In each case, the Markov chain is located in the center and is colored in black. Other graphs in the random walk sets are colored based on the graph type (undi-rected=green, balanced directed=blue, unbalanced=red), and are depicted as miniature graphs without edge weights (except for one representative example of each type). The Markov chains of Figure 15(a, b) are both recurrent, and we indeed see that their random walk sets contain balanced graphs (Theorem 4.1.3). In both figures, notice that more unbalanced graphs are drawn to reflect the fact that they are more numerous than the balanced cases. Furthermore, the chain in Figure 15(a) is reversible, whereas the chain in Figure 15(b) is non-reversible, meaning that in the former case all balanced graphs in RW(X ) are undirected, and in the latter case they are directed (Theorem 4.1.3). The Markov chain in Figure 15(c) is non-recurrent, and in agreement with Theorem 4.1.2 it contains only unbalanced graphs. Lastly, note that we do not include a corresponding diagram for the semi-reversible example in Figure 6(g). However, since such chains can be made reversible by removing non-recurrent states, a simple extension of Theorem 4.1.3 ensures that for such chains there exist graphs in RW(X ) for which the edges between recurrent states are undirected.
To summarize these observations, in Figure 15(d,e) we show two Venn diagrams that illustrate the relationships between the different types of Markov chains and graphs considered. In Figure 15(d), graphs are shown as an outer circle, with balanced graphs as a particular case, and undirected graphs as a special type of balanced graph, i.e. graphs ⊃ balanced graphs ⊃ undirected graphs (colored red, blue and green, respectively). In Figure 15(e), Markov chains are organized in a similar way, i.e. Markov chains ⊃ recurrent chains ⊃ reversible chains. Moreover, the colors in the Markov chain diagram are based on the types of graphs allowed in RW(X ) for each type of chain. For example, reversible chains are shaded in red and green since they correspond to random walks on either undirected or unbalanced graphs.

Graphs
Balanced Undirected

Markov chains Recurrent Reversible
(e) Figure 15: (a-c): The random walk sets of the three Markov chains in Figure 6(a, d, j), including undirected (green), balanced directed (blue), and unbalanced (red) graphs, as well as the chains themselves (black) which are reversible (a), recurrent and non-reversible (b), and non-recurrent (c). Venn diagrams in (d, e) illustrate the relationships between the three types of graph, and the three types of Markov chain, respectively. In (d) the colors indicate the type of graph, whereas in (e) they indicate the types of graphs that exist in the random walk sets of each type of Markov chain. For both Venn diagrams, the color scheme is in accordance with (a-c).
The balanced graphs belonging to a recurrent Markov chain's random walk set are in some sense special, since the vertex degrees have a simple relationship to the stationary probabilities: Proposition 4.1.4. Let X be a recurrent Markov chain and G one of the balanced graphs in RW(X ). Then, the degrees of this graph are related to one of the stationary distributions π > 0 of X , via: where z = i d i is the volume of G (proof: see Appendix A).
For example, evaluating di z for v 1 in the balanced graphs in Figure 15(a) and Figure 15(b) yields 20 48 ≈ 0.417 and 290 714 ≈ 0.406, respectively, which are indeed the stationary probabilities of state s 1 for each of the corresponding Markov chains (see Figure 6(b, e)). This highlights something useful about balanced graphs, which is that the weight matrix W allows direct calculation of one of the stationary distributions without having to simulate the random walk. Conversely, for an unbalanced graph there is no universally valid expression relating stationary probabilities of the random walk to the vertex degrees.
Perhaps the most important conclusion to draw from this section is that one can always describe a reversible Markov chain as a random walk on some undirected graph. Since undirected graphs have symmetric weight matrices, and since matrices of this type have received a large amount of study in mathematics, this interpretation provides a number of tools for describing reversible chains in more detail. This is the focus of the next section. Directed graphs, on the other hand, are as of yet far less understood, meaning that the same level of description for non-reversible chains is not possible. However, in Section 4.3 we explore some cases where concepts can be extended to the directed/non-reversible case. Since balanced directed graphs are less common objects in graph theory, we do not dedicate a section to them and instead consider them briefly as a special case in Section 4.3.

Relationship to symmetric matrices
In this section, we explore in more depth the connections between real symmetric matrices and the transition matrices of reversible chains. We start by providing the following two results for real symmetric matrices: Theorem 4.2.1. A real matrix A ∈ R n×n is symmetric if and only if: where ·, · denotes the standard Euclidean inner product.
Theorem 4.2.2. A real matrix A ∈ R n×n is symmetric if and only if it is orthogonally diagonalizable, meaning that there exists a orthogonal matrix Y for which: where ∆ is a diagonal matrix.
A couple of details can be pointed out about Theorem 4.2.2. Firstly, by comparing Equation (98) to our analysis of Section 2.3, we see that the columns of Y are a basis of right eigenvectors of A and the rows of Y T are the corresponding dual basis of left eigenvectors. Secondly, both of these bases are orthonormal since Y is orthogonal. Thirdly, the matrix ∆ contains the eigenvalues of A, which are guaranteed to be real since all other matrices in Equations (98) and (99) are real. Lastly, it is worth emphasizing the existential condition of Theorem 4.2.2, since not all choices of eigenvectors of a symmetric matrix obey this result. On one hand, Equation (98) requires that the sets of left and right eigenvectors are chosen together to be a biorthogonal system. On the other hand, even if we assume this property, when a symmetric matrix has repeated eigenvalues the corresponding eigenvectors can be non-orthogonal. However, even in this case it is always possible to apply the Gramm-Schmidt procedure to make them orthonormal, and provided that we find such a basis we can relate the left and right eigenvectors in the following way: Proposition 4.2.3. Let A be a real symmetric matrix. Then, if Y is an orthonormal basis of right eigenvectors of A and Y T is the corresponding dual basis, for each eigenvalue λ ω the left and right eigenvectors are equal, i.e. l ω = r ω .
Clearly, for any Markov chain, reversible or otherwise, the transition matrix is itself rarely symmetric. Therefore, the results above do not directly apply to P . However, reversible Markov chains can be related in a number of ways to symmetric matrices, and by virtue of these relations they satisfy variants of the results above [38]. For example, from Section 2.6 we know that any flow matrix of a reversible Markov is symmetric. This fact allows us to establish the following analogue of Theorem 4.2.1: . Let X be a Markov chain with transition matrix P . Then X is reversible if and only if there exists a stationary distribution π > 0, such that: defines an inner product normalized by the stationary probabilities π i (proof: see Appendix A).
Furthermore, Theorem 4.1.3 tells us that there exist certain row scalings which transform the transition matrix P of a reversible Markov chain into symmetric matrices. In fact, this is not the only scaling operation that transforms P into a matrix of this type. The next result shows that such a matrix is also formed when we scale both the rows and columns of P as follows: Theorem 4.2.5. Let X be a Markov chain with transition matrix P . Then X is reversible if and only if for any stationary distribution π > 0: is a symmetric matrix. Furthermore, regardless of which stationary distribution is used, the matrix K is unique (proof: see Appendix A).
The first thing to note about Theorem 4.2.5 is that the output of the scaling operation is somewhat different to that in Theorem 4.1.3, since in the former case there is only a single symmetric matrix, whereas in the latter there are an infinite number. Additionally, Equation (102) implies that P and K are similar matrices (Section 2.3). Since similar matrices have the same eigenvalues and related sets of eigenvectors, we can use this to establish the following generalizations of Theorem 4.2.2 and Proposition 4.2.3: Theorem 4.2.6. Let X be a reversible Markov chain with transition matrix P . Then X is reversible if and only if it is diagonalizable with real eigenvalues, and there exists a basis of right eigenvectors that are orthogonal w.r.t. ·, · Π and a corresponding dual basis of left eigenvectors that are orthogonal w.r.t. ·, · Π −1 , where π > 0 is a stationary distribution of the chain (proof: see Appendix A).
Proposition 4.2.7. Let X be a reversible Markov chain with transition matrix P , and π > 0 one of its stationary distributions. Furthermore, let Y R be a set of right eigenvectors of P and Y L its dual basis, both of which obeying the orthogonality relations of Theorem 4.2.6. Then, if r ω and l ω are right and left eigenvectors of the same eigenvalue λ ω , they are related via l ω = Πr ω (proof: see Appendix A).
Theorem 4.2.6 is particularly important for practical reasons. Firstly, the diagonalizability of P implies a full set of linearly independent eigenvectors. Linearly independent feature spaces are often desirable from a computational perspective since they (i) reduce the overall redundancy, (ii) can express any function in R N , and (iii) ensure that certain matrix operations are well-defined. Furthermore, as already explained in Section 2, having a diagonalizable transition matrix means that evolving a Markov chain becomes computationally cheaper. Secondly, since P is a real matrix with real eigenvalues, we can always choose an eigenbasis consisting only of real valued vectors. This property is useful because real spaces are often more intuitive to deal with than complex spaces. Furthermore, in many applications of Markov chains the underlying vector space is required to be real, either due to the semantic nature of the problem, or because the algorithm being used is not suited to complex spaces.
While a linearly independent set of eigenvectors is useful, having a basis that is pairwise orthogonal w.r.t. the standard Euclidean product offers further analytical and numerical benefits. From Theorem 4.2.6, it is clear that this is not the case for transition matrices of reversible Markov chains. However, the matrix K is a normalized symmetric version of P , and from the proof of Theorem 4.2.6 we know that the eigenvalues of these two matrices are the same and their eigenvectors are related simply by a multiplication of Π ± 1 2 . Thus, they contain similar information regarding the relationship between pairs of states in S, and so K can be used as a surrogate for P in situations where orthogonal eigenvectors are required.
When using the matrix K, it is sometimes useful to express it purely in terms related to a graph G. Starting with Equation (102), this can be done by swapping stationary probabilities for degree vertices: This expression appears in the next section, where we use K to define a positive semi-definite matrix which has the same eigenvectors. Collectively, the results of this section demonstrate that transition matrices of reversible chains satisfy similar properties to symmetric matrices, but subject to a different type of normalization. This alternative normalization is important for the next section, and in order to make our analyses more concise we end this section by defining the following two coordinate transformations: Definition 4.2.8 (Left and right coordinate transformations). Let G be an undirected graph with N vertices and degree matrix D. If x ∈ R N is a vector defined over these vertices, we define the following related vectors: which we call the left and right coordinate transformations of x, respectively.

Normalized Graph Laplacian
or equivalently if all its eigenvalues are non-negative. Such matrices have a number of numerical properties that make them useful for solving optimization problems, and because of this they often appear in computational applications. The matrix K is not positive-semidefinite, since it has the same eigenvalues as P (which can be negative). However, it is straightforward to define a variant of K which does have this property, while also having the same eigenvectors as K: Definition 4.2.9. For an undirected graph G with weight matrix W , the normalized graph Laplacian is given by: with elements equal to: and for which the following properties hold: • L is symmetric and positive semi-definite, with eigenvalues in the interval [0, 2].
• λ = 0 is guaranteed to be an eigenvalue and its multiplicity is equal to the number of connected components in G.
• If C is a connected component of G, then there is an eigenvector y C with eigenvalue λ = 0 whose entries are d 1 2 i for v i ∈ C and 0 otherwise, or equivalently y C = D 1 2 1 C = 1 C,L where 1 C is an indicator vector for vertices in C.
• Therefore, for fully connected graphs y = D There exist other types of graph Laplacians for undirected graphs, such as the unnormalized Laplacian L := D − W , and the random walk Laplacian L RW := 1 − D −1 W . Many useful properties of a graph G can be obtained from graph Laplacians, and they form the basis of spectral graph theory [51]. Due to its close relationship to P , in this tutorial we predominantly focus L. In the case of L, there is a weaker connection to P which we only discuss briefly. Furthermore, to avoid redundancy we do not discuss L RW at all, since it is the same as −P but shifted by the identity. For a concise review of each of these objects, see [55].
As the name suggests, graph Laplacians are analogues of the Laplace operator. In particular, a number of studies have found links between graph Laplacians and a variant of the Laplace operator on manifolds, known as the Laplace-Beltrami operator [56,57]. Broadly speaking, Laplace operators measure how much the value of a function at a point varies from its local average, which is informally related to the notion of mean curvature [58]. Furthermore, the eigenvalues of this operator are non-negative real numbers and are related to how much the corresponding eigenfunctions vary over the domain of the function [59]. All of these properties have analogues in the case of L, which justifies the term given to this object. To see this, we start with the following result [7]: Theorem 4.2.10. Let G be an undirected graph G with weight matrix W and normalized Laplacian L. Then for any vector x defining values on the vertices of G: which is guaranteed to produce a real non-negative number since L is positive semi-definite (proof: see Appendix A).
In Equation (112), the quadratic term measures the difference between the entries of a vector that is related to x by normalization with D − 1 2 , i.e. the vector x R . Since this is minimized when the entries of this vector are similar for pairs of vertices with large edge weight W ij , it can be interpreted to describe the smoothness of x R with respect to the connectivity of G. Furthermore, since this measure describes the smoothness of x R as opposed to that of x, it is useful for dealing with graphs that have a very heterogeneous degree structure.
If we evaluate the left-hand side of Theorem 4.2.10 using a normalized eigenvector y ω of L, we get: meaning that λ ω describes precisely this notion of smoothness for y ω . Now assume that we have an orthonormal set of eigenvectors of L. From Definition 4.2.9, we know that the corresponding eigenvalues are real numbers in the interval [0, 2], and that there is guaranteed to be at least one eigenvalue λ = 0. Thus, they can be ordered as follows: Putting these findings together, we can say that given an orthonormal set of eigenvectors of L, there is a natural way to order them based on their smoothness across G, according to Equation (112). For example, taking an eigenvector with λ = 0 (Definition 4.2.9) and plugging it into Equation (112) indeed yields zero. These are therefore the eigenvectors of L that have the lowest smoothness score in Equation (112). Similarly, the eigenvectors of L with second-smallest eigenvalue are those that, subject to orthonormality, score the second lowest in Equation (112). As indicated by Equation (115), we can apply this ordering to all eigenvalues of L, meaning that the resulting set of eigenvectors gets progressively less smooth, or equivalently more oscillatory, as λ ω increases. Since this basis is orthonormal, it can express any signal x ∈ R N on G, with the resulting components describing the projection of x onto each of these oscillatory modes. Due to its correspondence with Fourier analysis, this decomposition has been referred to as the Graph Fourier Transform of x [60]. Moreover, a useful result in spectral graph theory is that a basis of eigenvectors corresponding to the k leading eigenvalues of L is one which, subject to orthonormality, has the lowest possible total score in Equation (112). In other words, if Y ∈ R N ×k represents such a basis of right eigenvectors, it solves the following optimization problem [3,4,55]: This property underlies several Laplacian-based methods in unsupervised learning [17,19], where typically the eigenvectors of L with the smallest eigenvalues are those that capture the most important information relating to the learning objective, and so discarding eigenvectors with higher eigenvalues can provide a description of a data set that has lower dimensionality, while also being more interpretable. These observations can be related to the random walk on G. Since L shares an eigenbasis with K, and the eigenbasis of K is related to that of P by Theorem 4.2.5, we can relate the eigenvectors of L to those of P as follows: Proposition 4.2.11. If y ω is an eigenvector of L with eigenvalue λ ω , then l ω = D 1 2 y ω = y ω,L and r ω = D − 1 2 y ω = y ω,R are left and right eigenvectors of P , respectively, with eigenvalue 1 − λ ω .
Because of this, the eigenvectors of P have a similar form to those of L, but subject to the coordinate transformations of Definition 4.2.8. Therefore, they can also be interpreted using the notion of smoothness, with λ = 1 in some sense being the smoothest case.
To illustrate the relationship between the eigenvectors of L and P , we consider a simple Markov chain consisting of a linear arrangement of 100 states, where only nearest neighbor transitions are allowed (Figure 16(a)). Transitions to the left and right occur with probability 0.48 and 0.52, respectively, and the self loops at s 1 and s 100 mean that staying fixed is allowed in these states. Take a moment to verify that this chain is both ergodic and reversible (hint: in the latter case, try Theorem 2.6.5). Once we know the stationary probabilities π i of this chain, we can calculate the corresponding matrices K and L by using Equation (102) and then Equation (108). In Figure 16(b), we plot the left and right eigenvector of P with λ = 1, which are the stationary distribution π and η = (1, 1, ..., 1) T , respectively. The stationary probabilities are larger for states near the right end of the chain by virtue of the tendency for rightward transitions in this chain. Additionally, we plot the corresponding eigenvector of L with λ = 0, which is y = (π 1 2 1 , π 1 2 2 , ..., π 1 2 100 ) T . This vector is in some sense an intermediary between π and η since, in agreement with Proposition 4.2.11, π = Π 1 2 y = D 1 2 y = y L and η = Π − 1 2 y = D − 1 2 y = y R (as indicated by the black arrows in Figure 16(b)). In Figure 16(c), we show four more eigenvectors of L having eigenvalues closest to 0. In agreement with our foregoing discussion, the eigenvectors get less smooth as λ increases, becoming more oscillatory and resembling trigonometric functions over the state space. In Figure 16(d, e), we do the same for the left and right eigenvectors of P , respectively. The smoothness of these eigenvectors also depends on the size of λ, however this time we are interested in those with eigenvalues closest to 1, and the smoothness decreases as λ decreases. Furthermore, in comparison to Figure 16(c) these eigenvectors have an additional weighting effect across the state space. The left eigenvectors have larger amplitudes for states on the right-hand side, which is intuitive since these states have higher stationary probabilities π i , and the eigenvectors can be obtained from those of L by the left coordinate transformation of Definition 4.2.8. For the right eigenvectors, the weighting is the opposite, with states on the left-hand side having larger amplitudes. This is explained by an equivalent argument, except that this time we apply the right coordinate transformation to the eigenvectors of L. It should be noted that for the purpose of visualization, all eigenvectors in Figure 16(c-e) are normalized to have Euclidean norm 1, even though for the eigenvectors of P this is not the natural normalization (Theorem 4.2.6). The example chain in Figure 16(a) is somewhat artificial since the transition probabilities are homogeneous, which is why the stationary distribution in Figure 16(b) is such appears to be so smooth visually. This in turn means that the underlying graph G has a degree structure that varies somewhat smoothly from left to right, which is partly responsible for the other eigenfunctions also being so smooth. To see how this generalizes as we make the transition probabilities less homogeneous, we consider a modified chain which we show in Figure 17(a). In this model, left transitions from a state s i happen with probability 0.52 + ξ i and right transitions with probability 0.48 − ξ i , where ξ i is a number sampled from a uniform distribution on the interval [−0.1, 0.1]. Therefore, this chain is like the previous example, but with a random perturbation at each state. In Figure 17(b-e), we show the corresponding versions of Figure 16(b-e). As one can see, the eigenvectors of L as well as the left eigenvectors of P have similar shapes to the previous case, but they appear more rugged due to the inhomogoneity of the transition probabilities. Given that the perturbations made to the previous model are small in size, this is an illustration of the fact that these sets of eigenvectors are very sensitive to local information. Rather interestingly, the same is not true for the right eigenvectors of P : in Figure 17(b) the right eigenvector with λ = 1 is the same as in the previous example, and in Figure 17(e) the other right eigenvectors are partially distorted but still appear somewhat smooth visually. This can be understood in the following way. Since the right eigenvectors of P are the right transformations of the eigenvectors of L, we can reformulate the optimization problem in Equations (116) and (117) in terms of the basis where we have used the fact that D Therefore, for right eigenvectors of P , the relevant smoothness measure is related to L as opposed to L. Furthermore, writing Equation (112) in terms of L and x R gives: which we can interpret to mean that the vectors which optimize this measure, i.e. the right eigenvectors of P with eigenvalue close to 1, have entries that are close for vertices with large W ij . In the context of the example in Figure 17, this means that such eigenvectors have similar values for neighboring vertices, and a quick check reveals that this is indeed the case in Figure 17(d). Conversely, for the eigenvectors in Figure 17(b,c) the values on average vary more between neighbouring vertices. Lastly, we note that the ordering of eigenvalues used in this section is somewhat different to that in Section 2.3. In the current section, the eigenvalues of L are ordered from 0 up to 2, which corresponds to the eigenvalues of P being ordered from 1 to −1. To reflect our interpretation, we call this ordering by smoothness. In Section 2, however, the eigenvalues of a transition matrix are ordered by their absolute value, which describes how long the contribution from the corresponding eigenvector persists as the chain evolves. We therefore call this choice ordering by persistence. The suitability of either of these types of ordering depends on the problem domain in which a Markov chain is being used. For example, in the case of spectral clustering they correspond to distinct objectives, as demonstrated in [12].
This concludes our treatment of random walks on undirected graphs. As a summary of the results given, in Table 2 we compare the mathematical properties and relationships of L, K and P . The material presented in this section is particularly important for applications in machine learning and data mining in which the data set can be formulated as a graph. In particular, it underlies work that has been done on problems such as spectral clustering [8,9,10,11,12], manifold learning/graph embedding [15,16], graph-based classification [20,21,22], and value function approximation in reinforcement learning [24,25,29,27,28]. In the next section we consider how, if at all, the material presented in this section generalizes to directed graphs.

Random walks on directed graphs
Broadening our consideration to directed graphs is necessary if we want to describe non-reversible Markov chains. However, since many of the guarantees established in Section 4.2 do not hold for directed graphs, this case is a lot harder to treat analytically. In Section 4.3.1 we explore the main challenges that occur when applying spectral graph theory to the transition matrices of non-reversible Markov chains, and in Section 4.3.2 we describe methods for circumventing these issues. Finally, in Section 4.3.3 we define a generalization of L to directed graphs, and in Section 4.3.4 we present a method for enforcing ergodicity on random walks of directed graphs.
Eigenvectors lin. indep. orthogonal orthogonal Left eigenvectors y ω,L y ω y ω Right eigenvectors y ω,R y ω y ω Table 2: A summary of the properties established for the three matrices associated to random walks on undirected graphs: P , K, and L.

Key difficulties
Since many of the guarantees established in Section 4.2 do not hold for directed graphs, they are a lot harder to treat analytically. Perhaps most importantly, the transition matrices of non-reversible chains are neither guaranteed to be diagonalizable, nor to have real eigenvalues [61], which can be observed even for simple cases such as those shown in Figure 18(a-c). There are nonetheless some cases for which both properties hold [61] (as shown by the example in Figure 18(d)), but it is still not fully understood to what degree, if at all, the transition structure of a non-reversible chain determines either the diagonalizability of P or whether its eigenvalues are real/complex. 8 When P is non-diagonalizable, it does not have a set of N linearly independent eigenvectors, which can cause numerical issues since in this case some matrix operations are computationally more expensive or not well-defined. Moreover, if the eigenvalues of P are complex, then so are its eigenvectors. As in the real case, we can choose to order these eigenvectors based on persistence, or smoothness. In the former case, the generalization is somewhat straightforward since |λ| still describes how long each eigenvector typically persists. In the latter case, however, the question of how to generalize the concept of smoothness to complex eigenvectors is non-trivial, and is still an actively researched topic in the literature [65,66]. These factors make analyzing the transition matrices of non-reversible Markov chains more challenging than in the reversible case.

Alternative methods
One general technique for treating a non-diagonalizable matrix X is to add a perturbation so that it becomes diagonalizable. This is based on the notion that diagonalizable matrices densely fill the set of all matrices [67], meaning that it is always possible to find some nearby matrix X that is diagonalizable. In one study, a method along these lines was developed for dealing with non-diagonalizable transition matrices [68]. In particular, for a starting transition matrix P , a perturbation matrix E is found such that P = P + E preserves a number of the spectral properties of P and is diagonalizable. However, two limitations of this method are (i) that it has computational complexity O(N 8 ) for N × N matrices, and (ii) the resulting transition matrix can still have complex eigenvalues.
Other lines of work have attempted to circumvent these issues by using alternative matrix decompositions, with a prominent example being the real Schur decomposition [31,34,61,69,70]. 9 This decomposition provides a set of real orthogonal basis vectors, known as Schur vectors, that spans the eigenspaces of X [67]. This basis is not unique and corresponds to some ordering of the eigenvalues of X, with the first k Schur vectors spanning the eigenspaces of the first k eigenvectors in this ordering. Therefore, given some Schur decomposition of X, if U k = (u 1 , u 2 , ..., u k ) is the set of first k Schur vectors, then for any linear combinationũ = k γ=1 c γ u γ it is guaranteed that Xũ ∈ span(U k ). For this reason, U k is said to be an invariant subspace of X [67], and when k << N it provides a low dimensional description of the transformation that X represents. The real Schur decomposition is therefore a useful alternative to the eigendecomposition. However, in order that the basis captures the most important information about X, a reordering algorithm is needed to specify which eigenspaces of X it should span, and various methods for this have been developed [71,72,73,74,75]. Furthermore, it is worth emphasizing that in contrast to the eigendecomposition, the real Schur decomposition is guaranteed to exist for any real square matrix X, meaning that it sidesteps the issues of non-diagonalizability and complex feature spaces that can occur with transition matrices of non-reversible Markov chains. In the field of machine learning, the real Schur decomposition of transition matrices has been used as a tool for clustering [69] as well as for building state representations in reinforcement learning [70].

Directed Normalized Graph Laplacian
In Section 4.2.2, the normalized graph Laplacian L was introduced as a way to get a more precise description of the left and right eigenvectors belonging to transition matrices of reversible Markov chains. Generalizing L to directed graphs is challenging since two of its defining features are that it is symmetric and positive semi-definite, neither of which can be satisfied by Equation (110) if W is non-symmetric. However, various definitions for directed graphs exist, and while some loosen the constraint that L should be positive semidefinite [76,77,78,79], others strictly enforce this via a type of symmetrization [80]. We here focus on the latter type, and demonstrate connections that this has to some of the material in Section 2.
Perhaps the simplest method along these lines is to symmetrize the weight matrix W of a directed graph G to get an alternative weight matrix W sym , e.g. W sym = W + W T or W sym = W T W , and then use the regular definition of the normalized Laplacian using this new matrix, i.e. D − 1 2 W sym D − 1 2 . Since W sym describes an undirected graph, the resulting object can be interpreted in the same way as Section 4.2.2. However, a major drawback of this approach is that the graphs described by W and W sym can have very different structural properties. For instance, there is no guarantee that the random walks on these two graphs have stationary distributions that bear any resemblance to one another. Indeed, various studies in machine learning have indicated that symmetrizing W leads to a significant erasure of structural information from a directed graph [81,82,13,26].
A more principled approach was given in [80], in which the directed normalized graph Laplacian is defined as: where P and Π are defined as usual. In the original paper, restrictions are placed on the underlying graph so that P is ergodic. Among other things, this means that the stationary probabilities are all non-zero, which is needed for Π − 1 2 to be well-defined. In the next section, we present a method for enforcing this property for any directed graph G, but for now we simply assume that π > 0. Equation (121) can be simplified as follows: where P A is the additive reversibilization of the random walk (Definition 2.6.6). Thus, by comparing Equation (123) to Equation (109), we see that L dir is a variant of L but where P has been exchanged for P A . Thus, this corresponds to a symmetrization of the stationary flow between states (Equation (80)). While this transformation still destroys some information about the underlying graph G, it has been observed empirically that this effect is less severe in comparison to methods that symmetrize W itself [81,82,13,26]. For example, note that P and P A describe random walks that have the same stationary distributions (Definition 2.6.6). The only time that symmetrizing W and using P A yields the same result is in the special case of a balanced directed graph. In this case, Π meaning that L dir is equivalent to an additive symmetrization of W . However, in most practical applications directed graphs are not balanced, and it is in these cases that using P A preserves more information about a directed graph than simply symmetrizing W . The directed normalized Laplacian has been used in various contexts of machine learning as a way to generalize methods that are restricted to undirected graphs. It has been applied to problems such as spectral clustering [13,14], graph embedding [83,84], graph-based classification [23], and value function approximation in reinforcement learning [26]. In most of these applications, ergodicity is enforced on the random walk described by P , and in the next section we introduce the standard method for doing this.

Random surfer model
As explained in Section 2.5, ergodic Markov chains have the useful property that they are guaranteed to converge to a unique stationary distribution, and various reasons were given for why this is desirable in a general context. For directed graphs, it is a particularly beneficial property, since without it a random walk can get trapped in a small cluster of states, or even a single absorbing state. Because of this, sometimes ergodicity is enforced for random walks on directed graphs [85,23,14,13,26]. We remind readers that one effect this has is that π > 0, meaning that the directed normalized Laplacian is well-defined.
The typical method used for enforcing ergodicity can be thought of as a variation on the random walk process. Given a directed graph G with weight matrix W , at each time step there are two possible outcomes: either a regular random walk is performed with probability α, or the process teleports randomly to any vertex with probability 1 − α. This is known as a random surfer model or teleporting random walk, and it appeared for the first time in the PageRank algorithm [85]. If P and P tel represent the transition probabilities of the two possible outcomes at each time point, then the overall process is described by the following transition matrix: which is sometimes referred to as a Google matrix [86]. It is worth noting that there exist many variants of the random surfer model that differ in the assumptions they make about P tel [87]. For example, teleporting transitions can either be uniformly random, or biased towards certain vertices through a set of weights. Furthermore, the parameter α ∈ [0, 1] is known as the damping factor and determines how close the process is to a regular random walk. Typically, it is set close to 1, so that the process still accurately reflects the structure of the underlying graph G.
In the case of uniform teleportation, it is straightforward to specify the teleportation probabilities since they are all equal to 1 N . Therefore, Equation (128) becomes: where 1 ∈ R N ×N denotes a matrix of ones. While P represents the random walk on G, we can interpret 1 N as the transition matrix of a random walk on a graph G C that has the same number of vertices as G but where each vertex v i is connected to all others including itself. 10 An example is shown in Figure 19, with (a) showing the starting graph G and (b) showing G C . In (c, d) we show the transition matrices P and P tel of the random walks on G and G C , respectively. Finally, in (e) we show the transition matrix of the overall teleporting random walk with α = 0.85 is shown.  1 1 1 1 1 1 1  1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 Figure 19: (a) a directed graph G, (b) the graph G C , (c-d) the corresponding transition matrices P and P tel , respectively, (e) the random surfer transition matrix P for α = 0.85, with transition probabilities accurate to 2 decimal places.
Due to the teleportation term P tel , from any given state s i there is always a non-zero probability to access any other state, or to stay in the same state. By virtue of this, such processes are guaranteed to be both irreducible and aperiodic, and therefore ergodic.

Summary
This concludes our treatment of random walks. The material presented in this section forms a useful framework that connects Markov chains and graphs. On the one hand, describing a Markov chain as a process taking place on a graph is a useful interpretation since it provides intuition about underlying relationships between states. Furthermore, it allows one to apply the toolkit of spectral graph theory to Markov chains. On the other hand, graphs by themselves represent only static relationships between entities, and performing a random walk is one way to describe a graph in terms that are dynamic/temporal. Moreover, the fact that a transition matrix can be easily exponentiated, i.e P k , means that a random walk provides information about a graph G at multiple time scales, which is a property that has been exploited in the field of manifold learning [15,16]. As a final note, in the current section many concepts and results from linear algebra are required, for which we recommend [35] as a general resource and [31] as a more specific summary of the application to transition matrices.

Conclusion
The key motivation of this tutorial is to provide a single introductory text on the spectral theory of Markov chains. By bringing together concepts and results from different areas of mathematics, this work is a useful resource for readers hoping to gain a broad, yet concise, overview of this topic. Since we only assume minimal exposure to concepts from linear algebra and probability theory, and since focus is placed on providing intuition rather than rigorous results, the material of this tutorial is accessible to researchers and students in a variety of quantitative disciplines. For those working in fields related to machine learning and data mining, this work is particularly relevant due to the applications discussed at various points. Although the material mostly consists of known results, two novel contributions are the categorization of eigenvalues given in Table 1 and Figure 3, as well as the notion of random walk sets (Definition 4.1.1).
Our presentation involved two different paradigms for interpreting and analyzing Markov chains. In Section 2 we presented a categorization based on the transition structure and asymptotic behavior, and in Section 3 we instead formalized the idea of a Markov chain as a type of graph. In Section 4, we connected these two perspectives by introducing the idea of a random walk, and in doing so provided a number of parallels between some categories of Markov chains and certain types of graphs. In particular, one theme that aligns the two perspectives is the distinction between reversible/non-reversible Markov chains on the one hand, and undirected/directed graphs on the other, where in both cases the former option is easier to treat than the latter. With the additional use of results from linear algebra, this provided us with an in-depth description of the eigenvalues and eigenvectors of transition matrices in the reversible case. Finally, we discussed various attempts that have been made to generalize spectral methods to the non-reversible case.

Acknowledgements
We would like to thank Jonathan Hermon for some useful discussions on Markov chain theory, as well as Josué Tonelli-Cueto for his insight on the Perron-Frobenius theorem and various concepts in linear algebra.

Appendices Appendix A Proofs
A.1 Proposition 2.6.2 Proof. Assume that P is the transition matrix of a reducible recurrent Markov chain with r communicating classes. Therefore, for a suitable indexing of states in S the transition matrix of this chain has a block diagonal form: where P k is a transition matrix composed of the transition probabilities for states in the k-th class. Furthermore, if π > 0 is one of its stationary distributions, it can be written as π = r k=1 α k π k where each π k has non-zero entries only in the k-th class and α k > 0 (Proposition 2.4.9). Thus, given the same indexing of states the matrix Π and its inverse Π −1 have the following forms: (131) where Π k is the diagonal block matrix formed from the non-zero entries of π k . Hence, because P rev is a product of three block diagonal matrices, it too has the same form: where P rev,k contains the transition probabilities of states in the k-th class for the time reversed Markov chain. Furthermore, we can easily evaluate these block matrices: Thus, the matrix P rev does not depend on the α k terms that parameterize the stationary distribution, meaning that the time reversed Markov chain the same regardless of which distribution is considered.
A.2 Proposition 2.6.3 Proof. Assume that P is the transition matrix of a recurrent Markov chain and π > 0 is one of its stationary distributions. First note that summing over the j-th row or column of the flow matrix F π gives the same result: = π j (138) (139) which in vector notation can be written as π T = 1 T F π = 1 T (F π ) T , where the symbol 1 denotes a vector of ones. Using this we see that: π T = π T P (140) Therefore, π is a stationary distribution of X if and only if it is a stationary distribution ofX .

A.3 Theorem 4.1.2
Proof. (⇒) Assume that P is the transition matrix of a recurrent Markov chain. Then, for any stationary distribution π > 0 the flow matrix F π = ΠP corresponds to one of the allowed graphs in RW(X ). Therefore, using the same argument given in the proof of Proposition 2.6.3, the row and column sums of F π are the same, meaning that this matrix describes a balanced graph.
(⇐) We adapt this direction of the proof from [48], where the unweighted case is considered. Assume that G is a balanced graph with weight matrix W , and that X is the Markov chain realised by a random walk on this graph. Now consider the distribution π = 1 z (d 1 , d 2 , ..., d N ) T , where z = N j=1 d j = vol(G) is needed in order for π to sum to 1. We can then easily verify that the equations of global balance hold for this distribution (Equation (19)): and so π T P = π T which means that π is a stationary distribution of the chain. Note that the summation in the fourth expression defines the in-degree of vertex v j , but since the graph G is balanced we only have one degree d j associated to each vertex. Lastly, since isolated vertices are not allowed, each degree must be bigger than zero, meaning similarly that π i > 0 ∀s i ∈ S. Hence, there are no transient states, and X is recurrent.

A.4 Theorem 4.1.3
Proof. Assume that X is a recurrent Markov chain and G is one of the balanced graphs in RW(X ). Using the same argument from the second part of the proof of Theorem 4.1.2, the degrees of G are related to a stationary distribution π > 0 of the chain via π i = di z . From Theorem 2.6.4 we know that X is reversible if and only if the flow matrix associated π is symmetric, and it is straightforward to show that this equivalent to G being undirected: and since G was chosen arbitrarily, this applies to any balanced graph in RW(X ).
A.5 Proposition 4.1.4 Proof. See the second part of the proof of Theorem 4.1.2.

A.6 Theorem 4.2.4
Proof. (⇒) Let x, x ∈ R n be two arbitrary vectors. Then, if P is the transition matrix of a reversible chain and π > 0 is one of its stationary distributions, it is guaranteed that ΠP = P T Π (Theorem 2.6.4). Using this, we can show that P satisfies Equation (100): x, P x Π = P x, x Π (157) (⇐) Let P be the transition matrix of a reversible Markov chain, and π > 0 one of its stationary distributions. Equation (100) can be written as: If we choose x = e i to be a vector with i-th entry equal to 1 and zeros elsewhere, this reduces to: If we then choose x = e j to be a vector with j-th entry equal to 1 and zeros elsewhere, we get: Since the indices i and j were chosen arbitrarily, Equation (161) between any pair of states. Therefore, the chain is reversible (Theorem 2.6.4).

A.7 Theorem 4.2.5
Proof. Let X be a reversible Markov chain with transition matrix P and a stationary distribution π > 0, and let K = Π 1 2 P Π − 1 2 . Then the (i, j)-th element of K is: meaning that K is a symmetric matrix. Note that in the third line we made use of detailed balance which holds for π > 0 if and only if X is reversible. Therefore, this concludes both directions of the proof. In order to establish the uniqueness of K, we make use of the fact that a reversible Markov chain must be recurrent, meaning that any stationary distribution π > 0 is of the form π = r k=1 α k π k with α k > 0 (Proposition 2.4.9). Since all entries of π are greater than zero, K ij = √ π i P ij 1 √ πj = 0 iff P ij = 0. For a recurrent chain, the latter can only be true for pairs of states in the same communicating class. If s i and s j belong to the k-th communicating class, then: where π k,i denotes the i-th component of the stationary distribution associated to the k-th class. Hence, the non-zero entries of K do not depend on the values of α k , i.e. they are irrespective of the stationary distribution used.

A.8 Theorem 4.2.6
Proof. Let X be a reversible Markov chain with transition matrix P . The proof of Theorem 4.2.5 establishes that this is equivalent to K = Π 1 2 P Π − 1 2 being symmetric, where π > 0 is a stationary distribution of the chain. By virtue of Theorem 4.2.2, this in turn is equivalent to the existence of an orthogonal matrix Y for which: Equation (170) says that P has the same eigenvalues as K, which are real. It also tells us that P is diagonalizable by sets of right and left eigenvectors that are related to the eigenvectors of K by Π − 1 2 and Π 1 2 , respectively. Hence, if y ω is an eigenvector of K with eigenvalue λ ω , then r ω = Π − 1 2 y ω and l ω = Π 1 2 y ω are a pair of corresponding right and left eigenvectors of P with the same eigenvalue. Using this we see that if r ω and r γ are a pair of right eigenvectors of P , then: where we have used the fact that the basis Y is orthonormal. Similarly, if l ω and l γ are a pair of left eigenvectors of P , then: l ω , l γ Π −1 = Π Proof. From the proof of Theorem 4.2.6, we know that if Y is an orthonormal basis of K and y ω is a basis vector with eigenvalue λ ω , then r ω = Π − 1 2 y ω and l ω = Π 1 2 y ω are right end left eigenvectors of P , respectively, with the same eigenvalue. Therefore: l ω = Π Proof. Assume that L is the normalized Laplacian of an undirected graph G and x ∈ R N . Then: