Abstract
Markov chains are a class of probabilistic models that have achieved widespread application in the quantitative sciences. This is in part due to their versatility, but is compounded by the ease with which they can be probed analytically. This tutorial provides an in-depth introduction to Markov chains and explores their connection to graphs and random walks. We use tools from linear algebra and graph theory to describe the transition matrices of different types of Markov chains, with a particular focus on exploring properties of the eigenvalues and eigenvectors corresponding to these matrices. The results presented are relevant to a number of methods in machine learning and data mining, which we describe at various stages. Rather than being a novel academic study in its own right, this text presents a collection of known results, together with some new concepts. Moreover, the tutorial focuses on offering intuition to readers rather than formal understanding and only assumes basic exposure to concepts from linear algebra and probability theory. It is therefore accessible to students and researchers from a wide variety of disciplines.
1 Introduction
Markov chains are a versatile tool for modeling stochastic processes and have been applied in a wide variety of scientific disciplines, such as biology, computer science, and finance (Pardoux, 2010). This is unsurprising considering the number of practical advantages they offer: (1) they are easy to describe analytically, (2) in many domains they make complex computations tractable, and (3) they are a well-understood model type, meaning that they offer some level of interpretability when used as a component of an algorithm. Furthermore, as we show in this tutorial, Markov chains are temporal processes that take place on graphs. This makes them particularly suitable for modeling data-generating processes that underlie time series and graph data sets, both of which have received much attention in the fields of machine learning and data mining (Aggarwal, 2015).
The application of Markov chains requires the assumption that at least some aspect of the process being modeled has no memory. An important consequence of this assumption is that the process can be described in detail using a transition matrix. Furthermore, there exists a rich framework for describing distinct features of such processes based on the eigenvalues and eigenvectors of this matrix. This tutorial provides an in-depth exploration of this framework, making use of tools from probability theory, linear algebra, and graph theory. Since the work is intended for readers from diverse academic backgrounds, we concentrate on providing intuition for the tools used rather than strict mathematical formalism.
The material presented underlies multiple methods from different areas of machine learning, and instead of exploring these methods individually, we focus on the general properties that make Markov chains useful across these domains. Nonetheless, so that readers can appreciate the scope of the tutorial, we briefly summarize the methods that it is relevant to. In graph-based unsupervised learning, it is related to nonlinear dimensionality reduction techniques such as Laplacian eigenmaps (Belkin & Niyogi, 2001, 2003) and spectral clustering (Weiss, 1999; Ng et al., 2001; von Luxburg, 2007). These two closely related methods aim to represent data sets in a way that preserves local geometry and are traditionally formulated using graph Laplacians. However, one line of work on spectral clustering instead uses Markov chains (Meilă & Shi, 2000, 2001; Tishby & Slonim, 2001; Saerens et al., 2004; Liu, 2011; Meilă & Pentney, 2007; Huang et al., 2006; Weinan et al., 2008). Furthermore, the method of diffusion maps (Coifman et al., 2005a, 2005b; Coifman & Lafon, 2006) is a generalization of Laplacian eigenmaps that is based on Markov chains and can be tuned to different length scales in a graph, thereby allowing a multiscale geometric analysis of data sets. An in-depth survey of Laplacian eigenmaps, spectral clustering, diffusion maps, and other related methods can be found in Ghojogh et al. (2021). In the domain of time series analysis, the tutorial is relevant to slow feature analysis (SFA) (Wiskott & Sejnowski, 2002), a dimensionality-reduction technique that is based on the notion of temporal coherence and is conceptually related to Laplacian eigenmaps (Sprekeler, 2011). The ideas underlying Laplacian eigenmaps and spectral clustering have also been extended to classification problems, both for labeled (Kamvar et al., 2003) and partially labeled data sets (Szummer & Jaakkola, 2001; Joachims, 2003; Zhou et al., 2005). Finally, the material presented in this tutorial also forms the basis of various approaches to value function approximation in reinforcement learning, such as Mahadevan’s proto-value functions (Mahadevan, 2005; Mahadevan & Maggioni, 2007; Johns & Mahadevan, 2007), Stachenfeld’s work on successor representation (Stachenfeld et al., 2014, 2017), and other closely related methods (Petrik, 2007; Wu et al., 2019). Something common to many of the applications mentioned thus far is that they assume all underlying graphs to be undirected or, equivalently, that the corresponding Markov chain is reversible. This provides a number of guarantees that are crucial for these methods to work, and we explore these guarantees in depth in this tutorial. In most cases, the extension to the directed/nonreversible setting faces a number of challenges and is still actively researched. We discuss these challenges and present various solutions that have been suggested in the literature.
The rest of the text is organized as follows. In section 2, we give a general introduction to discrete-time, stationary Markov chains on finite state spaces and explore some specific types of chains in detail. Section 3 then gives a formal introduction to graphs in order to provide a more detailed description of Markov chains. In section 4, random walks are presented as a canonical transformation that turns any graph into a Markov chain, and the undirected and directed cases are considered separately to better understand the types of Markov chains that they typically give rise to. Finally, in section 5, we explore applications of the material in earlier sections in two areas of computer science literature.
2 Markov Chains
2.1 Definition
Markov chains can also be depicted visually in the form of a graph, with the state space drawn as a collection of circles and labeled arrows between these circles representing the nonzero transition probabilities Pij. We call this diagram the transition graph of a Markov chain. A formal introduction to the mathematics of graphs is given in section 3, but until then, transition graphs are simply used as an illustrative tool.
In Figures 2a to 2d we do this for n = 10 trajectories of length 4, each starting with X0 = S. Each trajectory is depicted in a specific color, and consists of points in a transition graph plotted across four time points, so that the position of each point indicates a state that one of the trajectories is in at time t. Other than a slight bias toward studying (S) at each time point, it is hard to pick out any clear patterns using only these 10 trajectories. Figures 2e to 2h show similar plots for n = 100, but with all points colored black and the relative occupation of each state for t > 0 indicated by a percentage. Finally, we increase the number of trajectories to n = 1000 in Figures 2i to 2l. Percentages are again used to indicate the relative state occupations, but instead of representing the trajectories with dots, we color each state in gray-scale based on the percentage values. Comparing all the plots in this figure, one can note that in the first row, it is possible to track each of the individual trajectories, whereas in the second and third rows, the focus is instead on approximating the relative probability of doing each activity at each time.
2.2 Evolution via Matrix-Vector Multiplication
The structure and action of , just like any other matrix, can be evaluated using tools from linear algebra. In particular, can either multiply column vectors from the left or row vectors from the right. While typical conventions formulate matrix multiplication in the former way, the latter is more common for right stochastic matrices due to its semantic interpretation. Nonetheless, as both operations offer their own insight into the descriptive capacity of Markov chains, we outline both in this section.
A particularly special type of distribution for a Markov chain is one that is invariant under its evolution:
We can also ask how much probability mass moves from one state to another in a given stationary distribution . This is described by the following matrix:
2.3 Eigenvalues and Eigenvectors
Every real matrix can be interpreted as a linear transformation. A central task of linear algebra is to shed light on the relationship between the numerical properties of a matrix and various aspects of the transformation that it represents. Often a single matrix represents a combination of several distinct transformations; for example, an object in two or more dimensions can simultaneously be rotated and stretched. Finding the eigenvalues and eigenvectors of a matrix is one way to partition a linear transformation into its component parts and reveal their relative magnitudes. In this section, we give a brief and informal summary of how this works and apply this to transition matrices.
An eigenvector of a matrix is a vector that is only multiplied by some number λ when multiplied by . Like all other vectors, they can either be rows, that is, , or columns, that is, , which are known as left eigenvectors and right eigenvectors, respectively. In both cases, the number λ is called the eigenvalue of the respective eigenvector. It is worth noting that real matrices, including transition matrices, can have complex eigenvalues, , in which case the corresponding eigenvector is also complex, . However, such solutions can occur only in complex conjugate pairs, meaning that λ* and are also guaranteed to be an eigenvalue-eigenvector pair, where * denotes the complex conjugate.
where m is the number of eigenvalues of with |λω| = 1. In the long time limit (k → ∞), the terms with |λω| = 1 survive, whereas those with |λω| < 1 die off, with |λ| measuring the rate of decay in the latter case. This allows us to interpret the first and second sums in equation 2.54 to represent the persistent and transient behavior of the Markov chain, respectively.
In the persistent case, we can partition the terms into three types based on the eigenvalues: (i) λ = 1, (ii) λ = −1, and (iii) , |λ| = 1. We have already seen that a stationary distribution and the vector are left and right eigenvectors of type i, and in both cases, these eigenvectors represent fixed structures on the state space that persist when acted on by . To capture this property, we call such eigenvectors persistent structures. In case ii, eigenvectors flip their sign when acted on by the transition matrix, , and return after two steps, . Eigenvectors of this type therefore correspond to permanent oscillations of probability mass between states in , and we therefore refer to them as persistent oscillations. Eigenvalues of type iii are explored in more depth in section 3.5, where we show that they are always complex roots of unity, λk = 1 for some k > 2, which occur at equally spaced locations on the unit circle. Therefore, when their corresponding eigenvectors are acted on repeatedly by , they are returned to after k steps, . Thus, in analogy to case ii, they describe permanent cycles of probability mass through the state space , and we therefore call such eigenvectors persistent cycles.
In the transient case, an analogous categorization based on the eigenvalues can be applied, consisting of the following three types: (i’) λ ∈ [0, 1), (ii’) λ ∈ (− 1, 0), and (iii’) , |λ| < 1. For type i’, the corresponding eigenvectors represent perturbations to the persistent behavior that decay over time, where . We therefore call such eigenvectors transient structures. When |λj| ≈ 1 these structures describe sets of states that, on average, a chain spends a long time in before converging; these are known as metastable sets (Conrad et al., 2016). Furthermore, λ = 0 can be thought of as a limiting case of type i’ in the sense that any corresponding eigenvector decays infinitely quickly (i.e., ) and does not exhibit oscillatory or cyclic behavior. Eigenvalues of type ii’ also decay over time, but like case ii, their negative sign means that the corresponding eigenvectors exhibit oscillatory behavior when acted on by , that is, where . We refer to eigenvectors of this type as transient oscillations. Eigenvalues of type iii’ generalize those of type iii to the transient case since they are also complex and represent cycles of probability mass. In contrast to type iii, though, they do not occur on the unit circle and need not be equally spaced. Since |λ| < 1, these cycles decay over time, and we therefore call them transient cycles. When |λ| ≈ 1, these cycles can persist for a long time and have been referred to as dominant cycles (Conrad et al., 2016).
The above categorizations are summarized in Table 1, where structures, oscillations, and cycles are colored in green, blue, and red, respectively, and the persistent and transient cases are shaded bright and pale, respectively. In Figure 3, we visualize the six different types of eigenvalue using this color scheme by shading the respective regions of the unit circle in which they can occur.
Before moving on, a couple of details are worth pointing out. First, the above analysis is possible only if the transition matrix is diagonalizable. One case in which this is guaranteed to hold is for symmetric matrices. However, given that transition probabilities are rarely pairwise symmetric (i.e., Pij = Pji), this restriction is clearly too strong. Thus, a more detailed investigation is needed in order to identify the conditions under which the above decomposition can be made, which we do in sections 3 and 4. For a more in-depth account of the theory of diagonalizable matrices, we recommend Meyer (2000). Second, a quick check reveals that all terms in equation 2.54 are dependent on the components wj that describe the starting distribution (see equation 2.47). Because of this, in the most general case, both the persistent and transient behavior of a Markov chain can be sensitive to initial conditions. In a later section, we consider a particular type of Markov chain for which this is not the case, and provide a simplified analysis of their evolution over time.
2.4 Classification of States
In the following sections, we explore three types of Markov chains. In order to describe each type in detail, we first define various properties that apply to individual states, or sets thereof, in a state space . This is the focus of this section.
2.4.1 Communicating Classes
We start by making the following definitions related to how states in are connected:
(Accessibility). For two states si, , we say that sj is accessible from si, denoted si → sj, when it is possible to reach sj from si in k ≥ 0 steps, .
(Communication). Two states si, , are said to communicate if si → sj and sj → si, which is denoted si ↔ sj.
Communication is a useful property for describing states in a Markov chain, as exemplified by the following result:
(Communicating Class). Communication is an equivalence relation, meaning that:
, si ↔ si, since by definition each state can reach itself in 0 steps, ,
if si ↔ sj, then sj ↔ si,
if si ↔ sj and sj ↔ sk, then si ↔ sk,
Furthermore, we can make a useful categorization of Markov chains based on the number of communicating classes they have:
(Number of Communicating Classes). Let n be the number of communicating classes of a Markov chain. When n = 1, the chain is said to be irreducible; otherwise it is reducible.
In words, an irreducible Markov chain is one in which for any pair of states, there exists a connecting path in both directions. In Figure 4, three example Markov chains are shown, with the communicating classes indicated by the dashed boxes. Take a moment to double-check why each state belongs to its communicating class, and verify that the examples in Figures 4a and 4b are reducible, whereas the one in Figure 4c is irreducible. Finally, observe from Figure 4b that it is possible for states in one communicating class to be accessible from states in another class (e.g., s3 → s4).
Irreducible Markov chains feature in a number of subsequent sections of this tutorial due to the following result:
A Markov chain is irreducible if and only if it has a unique stationary distribution . Furthermore, this distribution has strictly positive elements, i.e. .
This result can be applied to the example of Figure 4c by using the observation of the previous section that stationary distributions of a given Markov chain are eigenvectors of the associated transition matrix with eigenvalue 1. If we were to find the transition matrix of the chain in Figure 4c and compute its eigenvectors, we would indeed find a single left eigenspace of corresponding to eigenvalue 1, with strictly positive elements. When normalized such that the row sum is 1, the resulting vector is , which we encourage readers to check themselves. For reducible Markov chains, the guarantees of uniqueness and positivity of the stationary probabilities no longer hold. In order to explore the stationary distributions of such chains, we introduce some new concepts for describing states.
2.4.2 Recurrence and Transience
Each state can be categorized based on how likely it is to be revisited, given that it is currently occupied. This is formalized by the following definition:
This is a useful way to characterize states in , since it generalizes to all states within a communicating class, that is, for any communicating class : either all states in C are recurrent or all states in C are transient. Transience and recurrence are therefore examples of class properties, and we henceforth use the terms recurrent class and transient class for communicating classes that contain recurrent and transient states, respectively. Furthermore, for finite state spaces, a Markov chain is guaranteed to have at least one recurrent class. Building on this, we can then apply the notion of recurrence to a Markov chain as a whole:
(Recurrent Chains). A Markov chain that contains only recurrent classes is called a recurrent Markov chain.
As an illustration, we apply these concepts to the reducible chains depicted in Figure 4. In Figure 4a, there are two communicating classes, and in each one, there is no possibility to exit. This means that if a state in one of these classes is occupied, it is guaranteed to be revisited at some future time step, that is, fi = 1 for all states in each class. Therefore, both classes are recurrent and the chain as a whole is recurrent. In Figure 4b, the main difference is that there is now the possibility to exit the blue class without returning. For example, assuming s1 is occupied, although there is a possibility that s1 can be visited again later (e.g., s1 → s2 → s1), as soon as a transition s1 → s4 takes place, this will no longer be possible. This is why fi < 1 for states in the blue class. Hence, while the red class is recurrent, the blue class is transient, and because of this, the chain as a whole is not recurrent.
For an irreducible Markov chain, such as the example shown in Figure 4c, every state is guaranteed to be recurrent, leading to the following proposition:
(Recurrence of Irreducible Markov Chains). All irreducible Markov chains are recurrent.
However, from the example in Figure 4a, it is clear that the converse does not hold, since it is also possible for a reducible chain to be recurrent. In fact, whether a reducible chain is recurrent or not determines certain features of the stationary distributions belonging to the chain. This is outlined by the following proposition:
(Stationary Distributions of Reducible Chains). For a reducible Markov chain with r recurrent classes and t transient classes:
Any stationary distribution has probability zero for states belonging to a transient class.
For the kth recurrent class, there exists a unique stationary distribution with nonzero probabilities only for states in that class.
- When the number of recurrent classes r is bigger than 1, stationary distributions can be formed via convex combinations of each distribution ,meaning that there is an infinite number of stationary distributions.(2.56)
Furthermore, when the number of transient classes t is zero or, in other words when the chain is recurrent, performing the procedure above with nonzero coefficients always yields distributions that are strictly positive.
We can apply these results to the two reducible chains in Figure 4. In the example of Figure 4a, the stationary distribution associated with the blue class is and the one associated to the red class is . We can then generate an arbitrary number of extra stationary distributions by taking convex combinations of and with coefficients α1 and α2. For example, with α1 = 0.25 and α2 = 0.75, we get . Furthermore, the last bullet point in proposition 3 tells us that since this chain is recurrent, we can be sure that any convex combination with positive coefficients yields a distribution with strictly positive entries. This property of recurrent chains is particularly relevant to our treatment of both reversible chains in section 2.6 and random walks in section 4, and we henceforth use to denote a stationary distribution with this property. In the example of Figure 4b, the red class is the only recurrent class, meaning that the stationary distribution associated with this class is the only stationary distribution of the chain. Since the transition probabilities for states in this class are the same as in the example of Figure 4a, this stationary distribution is . Furthermore, in agreement with proposition 3 we see that the transient states in this chain have a stationary probability of 0.
2.4.3 Periodicity
The notion that states can be revisited is also meaningful in another sense. We define the following quantity, which describes how frequently such revisits can take place.
Like transience and recurrence, period is also a class property, and we use d to refer to the period of a whole class. This in turn allows us to define the period of a Markov chain:
(Periodicity). When all communicating classes in have period d > 1, the Markov chain is said to be periodic, with period d.
(Aperiodicity). When all communicating classes in are aperiodic, the Markov chain is said to be aperiodic.
(Mixed Periodicity). If contains communicating classes with different periods, the Markov chain is said to have mixed periodicity.
Consider the example shown in Figure 5a. This chain has two communicating classes, the red one being a transient class with d = 2 and the blue one being a recurrent class with d = 1, meaning that this chain has mixed periodicity. In Figures 5b–5d we show three irreducible Markov chains, with panels b and c having period 3 and 2, respectively, and panel d being aperiodic.
Markov processes are a broad class of models, and even under the restricted settings considered in this tutorial (discrete time, homogeneous, and finite state spaces), there are many distinct types of chains. In the following sections, we concentrate on three particular types that are relevant in applied domains.
2.5 Ergodic Chains
When modeling a system that evolves over time, it is important to ask what can be said, if anything, about its long-term behavior. For a Markov chain, this question can be phrased in two ways. On one hand, we can sample a single trajectory starting from some initial state and ask what the average behavior is over time: How often is it found in each state for a trajectory of length t? On the other hand, we can describe our starting conditions as a distribution and ask what this evolves to in the future: What is the probability of being in each state si at a later time t? We refer to these two notions of long-term behavior as the trajectory and distribution perspectives, respectively. While the analyses given so far predominantly use the latter perspective, we remind readers that in section 2.1, we introduced the idea of a distribution over by taking the limit of an infinitely large ensemble of trajectories, meaning that the two concepts are closely related.
This two-way view originates from the field of statistical physics, where physical processes can either be analyzed with temporal averages (i.e., the trajectory perspective) or ensemble averages (i.e., the distribution perspective). One class of systems that has received a lot of study in this field are those for which these two types of averaging yield the same result as t → ∞. Such systems are known as ergodic systems, and this equivalence means that a statistical description of their long-term behavior can be described simply by a single, sufficiently long sample. One implication of this is that initial conditions are forgotten over time, which makes ergodic systems particularly attractive from a simulation or modeling perspective. Finally, with this in mind, we can define an ergodic Markov chain as follows:
(Ergodic Markov Chain). An ergodic Markov chain is one that is guaranteed to converge to a unique stationary distribution.
Clearly, in order for a chain to be ergodic, it must have a unique stationary distribution. Therefore, by virtue of theorem 1, a necessary condition for a chain to be ergodic is that it is irreducible. However, there is no guarantee that an irreducible chain converges, which is the second condition of definition 12. The convergence of a Markov chain is related to its periodicity, as explained by the following result:
We can apply this result to the irreducible Markov chain in Figure 5c. This chain has a unique stationary distribution and a period of d = 2. Therefore, there is no guarantee that the chain converges to since it can get trapped in persistent oscillations. To observe this, we can try out different initial conditions and iteratively apply the update rule in equation 2.12. For example, starting with the distribution , we get a persistent oscillation between the following two distributions: and . However, this is not the only persistent oscillation possible for this chain, which can be observed by trying out different initial conditions. Finally, in section 3.5, we gain more insight on theorem 2 by using tools from graph theory to describe the eigenvectors of with |λ| = 1.
A key insight from theorem 2 is that a Markov chain is guaranteed to converge only if it is aperiodic. Together with irreducibility, this therefore provides the conditions under which a chain is ergodic:
(Conditions for Ergodicity). A Markov chain is ergodic if and only if it is both irreducible and aperiodic, which respectively ensure that there is a unique distribution and the chain always converges to this distribution. Furthermore, the distribution is said to be the limiting distribution of the chain.
2.6 Reversible Chains
It is worth emphasizing that since no assumption is made in definition 13 about the number of communicating classes, it also applies in the case of reducible recurrent chains where there is an infinite number of distributions to choose from. However, in such cases, the choice of makes no difference:
(Time Reversal of Reducible Chains). For a reducible recurrent Markov chain , the time reversal is uniquely defined, with being independent of which stationary distribution is used (for the proof, see the appendix).
Moreover, the set of stationary distributions belonging to a recurrent Markov chain is the same as the set belonging to the corresponding time reversal:
(Stationary Distributions of Time Reversal). Let be a recurrent Markov chain. Then is a stationary distribution of if and only if it is a stationary distribution of .2 (For the proof, see the appendix.)
A number of observations can be made about this definition. First, the left (right) terms represent the flow of probability from si to sj (sj to si), given that the chain is described by distribution . Thus, for a reversible Markov chain in one of its stationary distributions, the flow from one state to another is completely balanced by the flow in the reverse direction, meaning that the flow matrix is always symmetric for such chains. By comparison with equation 2.19, we see that detailed balance is a stronger condition than global balance, since in the latter case, there is only an equivalence between the total flow in and out of each state. Second, since πi and πj are nonzero, it follows that Pij ≠ 0 if and only if Pji ≠ 0. Thus, the transition structure of a reversible Markov chain always permits the return to the previous state, and because of this, the period of a reversible Markov chain can be at most 2. Third, while some sources assume irreducibility as a precondition of reversibility, we instead base our definition on the weaker condition of recurrence (Porod, 2021). This is due to the fact that we only need recurrence in order to define the time reversal of a Markov chain. Furthermore, with this convention, theorem 4 applies more broadly to reducible Markov chains, which lets us make a closer comparison between reversible Markov chains and undirected graphs in section 4. Finally, theorem 4 implies that there are two distinct ways in which Markov chains can be nonreversible: (1) they can be recurrent without satisfying detailed balance, or (2) they can be nonrecurrent. In case 1, for any distribution , meaning that the flow matrix is asymmetric, and in case 2, no positive stationary distribution exists. Finally, note that for a nonrecurrent chain, there exists the possibility that is symmetric for all stationary distributions despite none of those distributions being strictly positive. For such chains, removing all transient states from produces a reversible chain. We therefore refer to such chains as semireversible.
To illustrate some of these points, we consider four Markov chains in Figure 6, with panels a, d, g, and j showing the transition graphs of each example, and panels b, e, h, and k showing the respective transition matrix, stationary distribution, and flow matrix (for simplicity, each example has a single recurrent class, so that both the stationary distribution and the associated flow matrix are unique). As a visual illustration of the pairwise stationary flow between states in each example, in panels c, f, i, and l, the stationary distribution is represented as a bar plot, with the portions of probability mass flowing in (left) and out (right) of each state shown as portions of each bar. The example in panel a is reversible, as can be seen by the symmetry of in panel b, or equivalently by the matching between left and right portions of all bars in panel c. The example in panel d is almost equivalent to the one in panel a, except that the outgoing transition probabilities from s1 have been slightly modified (indicated by the colored arrows in panels a and d and the colored entries of and in panels b and e). This modification is enough to violate detailed balance, as can be seen by the asymmetry of or the bar plot in panel f. Finally, the chains depicted in panels g and j are nonrecurrent since in both cases, state s3 has only outgoing transitions. Therefore, π3 = 0, and both examples are nonreversible. However, a quick check of or the bar plot in panel i reveals that the example in panel g is semireversible. Conversely, the example in panel h is identical to the one in panel g except for the outgoing transitions from s4 (again indicated by the colored arrows in panels g and j and the colored entries of and in panels h and k), which leads to an asymmetric stationary flow between the recurrent states.
It is worth pointing out that in our analysis above, we check for reversibility by inspecting the stationary distributions and the corresponding flow matrices of each example. However, since reversibility is a property associated with Markov chains and not with distributions, one might wonder whether there is an alternative way to formalize it based purely on the transition probabilities Pij. Clearly, equation 2.71 prohibits one-way transitions (i.e., Pij > 0 and Pji = 0), but this is only a necessary condition of reversibility. Can we offer anything more precise? Fortunately, the answer is yes, and it is given by Kolmogorov’s criterion (Kolmogoroff, 1936):
One way to understand this theorem is that for reversible Markov chains, the probability of traversing any closed path in the state space is independent of the direction of traversal. Hence, reversible Markov chains can be thought of as having zero net circulation. By contrast, recurrent Markov chains that are nonreversible have at least one path that violates equation 2.72, over which there is a higher probability to traverse in one direction than the other. For the example in Figure 6a, the relevant closed paths are (up to a cyclic permutation): (1) s1 ↔ s2 ↔ s4 ↔ s3 ↔ s1, (2) s1 ↔ s2 ↔ s4 ↔ s1, and (3) s1 ↔ s4 ↔ s3 ↔ s1. In any of these cases, going around clockwise is equally probable as going around counterclockwise, which is to be expected since this Markov chain is reversible. The example in Figure 6d has the same closed paths available, except that the outgoing transition probabilities from s1 have been changed. This small adjustment is enough to introduce circulation on all the closed paths: for both paths 1 and 3 the counterclockwise direction is more probable since P13P34P42P21 > P12P24P43P31 and P13P34P41 > P14P43P31, respectively, and for path 2, the clockwise direction is more probable since P12P24P41 > P14P42P21. Therefore, by virtue of having at least one path with net circulation, equation 2.72 confirms that this chain is indeed nonreversible.
These analyses illustrate how theorems 4 and 5 provide two alternative but equivalent definitions of reversibility. Something common to both of these interpretations is that reversible Markov chains satisfy a type of equilibrium, either between the exchange of probability mass between pairs of states or the circulation along closed paths, respectively. In fact, the concept of detailed balance stems from early work in the field of statistical mechanics aimed at formalizing the notion of thermodynamic equilibrium on a microscopic level (Gorban, 2014). More recently, Markov chain Monte Carlo methods, which are predominantly based on reversible ergodic chains (see section 5.1), have received widespread application in the natural sciences as a way to model systems that are in thermodynamic equilibrium (Richey, 2010). Conversely, Markov chains that violate detailed balance, or equivalently those with net circulation, have been applied to the less-well-understood case of systems which are out of equilibrium (Jiang et al., 2004; Zhang et al., 2012; Ge et al., 2012). Furthermore, their stationary distributions have been referred to as nonequilibrium steady states (NESS) (Jiang et al., 2004; Zhang et al., 2012; Ge et al., 2012; Conrad et al., 2016; Witzig et al., 2018), which reflects the fact that such distributions are kept fixed over time via unequal flows of probability mass between states ( is an example of a NESS, as can be seen in the bar plot in Figure 6f).
Reversible Markov chains are significantly easier to treat both analytically and numerically than nonreversible chains. Because of this, there exist various procedures for modifying a nonreversible Markov chain so that it becomes reversible, which is sometimes referred to as reversibilization (Fill, 1991; Brémaud, 1999). For a recurrent chain, this can be done by taking an average of the forward and backward transition probabilities, and , that describe the chain and its time reversal, respectively. This averaging process can be either additive or multiplicative, leading to the following two definitions:
2.7 Absorbing Chains
Finally, one concept in the theory of Markov chains that is particularly relevant to applied domains is absorption. A state is called absorbing if it is possible to transition into the state but not out of it, meaning that Pii = 1 and the chain stays in si for all future time steps. An absorbing Markov chain is one for which from every state , there exists some path to an absorbing state. Since it is possible to start in a nonabsorbing state and never return, all nonabsorbing states are transient, and the presence of such states means that absorbing chains can be neither reversible nor ergodic. Absorbing chains often occur in Markov decision processses (MDPs), which are central to the field of reinforcement learning (Sutton & Barto, 2018).
The possible transitions in an absorbing chain can be partitioned into three types: (1) transient → transient, (2) transient → absorbing, and (3) absorbing → absorbing. Although the assignment of indices to states in is arbitrary, an assignment based on this partitioning simplifies the analysis of absorbing Markov chains.
We depict this partitioning of transition probabilities in Figure 7a. An absorbing chain with one absorbing state is shown, with the transitions belonging to , , and colored black, red, and blue, respectively. Furthermore, in Figure 7b, we show the matrices and .
Since any transient state can reach an absorbing state in a finite number of steps, the probability that the chain ends up in an absorbing state at some future time is 1. For this reason, in the infinite time limit, we can expect to see no transitions taking place between transient states, that is, . This is an advantageous property, since it means that if we sum up all powers of , known as the Neumann series of , then the contributions for larger powers get progressively smaller and the sum converges to (see Meyer, 2000, p. 618). Calculating this sum for leads to the following useful quantity, which relates transient states in (Porod, 2021):
When analyzing an absorbing chain, it is very handy to have access to the fundamental matrix. By taking into account all nonnegative powers of , it contains information about all possible paths available between pairs of transient states. Because of this, it is a useful predictive tool that allows several properties of the Markov chain to be deduced (Porod, 2021). Furthermore, in the field of reinforcement learning, it is closely related to the successor representation (Dayan, 1993) (see section 5.2).
2.8 Summary
This concludes our exploration of different types of Markov chains. In Figure 8, we provide a summary of the material presented in this section in the form of a Venn diagram. In this diagram, each type of Markov chain is drawn as a circle or ellipse, with defining properties and results listed in each case. Take a moment to look at this image, and pay attention to the overlapping regions, which indicate how different types of chains are related. For a more in-depth presentation of the material in this section, we recommend Porod (2021) and Brémaud (1999).
In the next section, we introduce graphs as an alternative way to describe Markov chains and summarize insights that emerge from this description. Then, in section 4, we explore the connection between graphs and Markov chains in more depth using the notion of random walks, which allows various relationships to be made between specific types of graphs and some types of Markov chains introduced in this section.
3 Graphs
So far, we have implicitly been interpreting Markov chains as graphs whenever we draw a transition graph. In this section, we formally introduce the concept of graphs, which provides a foundation to the material on random walks in section 4. Readers should note that definitions in graph theory often vary among different sources. Here we use a convention that can encompass a wider variety of graphs, thereby offering greater generality.
3.1 Definition
A graph G = (V, E) is a set of N vertices V = {v1, v2, . . . , vN} together with an edge set E containing pairs of vertices in V. Conceptually, V might represent a collection of objects and E a specification of how some pairs in this collection are related to one another. A natural way to categorize graphs is based on the way in which edges are defined. For instance, in an undirected graph, each edge has no direction and is typically denoted as (vi, vj) ∈ E, whereas in a directed graph, each edge has a specified starting and ending vertex and is usually denoted as (vi → vj) ∈ E. Examples of undirected and directed graphs can be seen in the first and second rows of Figure 9, respectively. Unless otherwise stated, we depict undirected edges as straight lines and directed edges as curved lines with arrowheads indicating the direction. A second distinction we can make is between unweighted graphs, in which one only cares about whether two vertices are related or not, and weighted graphs, in which each edge has a positive weight wij describing the strength of the relationship.3 In Figure 9, the examples in the first column are unweighted and all other graphs are weighted, with weights indicated by numbers next to each edge. The type of edges that a graph has is often chosen based on the type of relationship that one wants to describe. For example, assume that we have a graph G where vertices represent PhD students. Then, if we want to represent the relationship of being in the same research group, undirected, unweighted edges are a natural choice (such as Figure 9a). Conversely, if we want edges to describe whether one student has participated on a main project of another student, then this clearly requires directed unweighted edges (such as Figure 9d). If we now consider variants of the first and second examples, instead focusing on how similar the research topics of two students are or how much work one student has contributed to another student’s project, then we now need undirected weighted and directed weighted edges, respectively (such as Figures 9b and 9c and Figures 9e and 9f). It is worth noting that in order to assign weights to edges, one needs to specify a scale on which to measure the strength of relationship between vertices.
One can also describe graphs based on their connectivity. In an undirected graph, if there exists a path between each pair of vertices, then the graph is said to be connected; otherwise it is disconnected. The notion of connectivity can generalize to disconnected graphs if we instead consider subsets of vertices in G, which are known as subgraphs. Any subgraph that is connected but is not part of any larger connected subgraph is called a connected component. Both of the undirected graphs in Figures 9a and 9b are connected, whereas Figure 9c shows an example that is disconnected, with two connected components. In particular, this latter example even has a vertex that does not have any edges at all; it is known as an isolated vertex. For a directed graph, if there are directed paths running from vi to vj and from vj to vi for all pairs of vertices vi, vj ∈ V, then the graph is said to be strongly connected. Alternatively, a directed graph is weakly connected if for all pairs of vertices vi, vj ∈ V, it is possible to get from vi to vj and from vj to vi by any path, ignoring the direction of the edges. Clearly, a directed graph is weakly connected if it is strongly connected, but not vice versa. Furthermore, strongly or weakly connected subgraphs that are not part of any larger such subgraphs are referred to as strongly or weakly connected components, respectively. The directed graphs in Figures 9d and 9e are strongly connected, whereas the one in Figure 9f is only weakly connected and has two strongly connected components (take a moment to verify this).
3.2 Matrix Representation
For the remainder of this tutorial, we assume that the graphs we deal with are both weighted and directed. The reason we choose this convention is that it is more general. On the one hand, any unweighted graph can be considered as a special case of a weighted graph where the weights are all set to 1. Thus, we henceforth only talk about weight matrices as opposed to adjacency matrices when describing graphs numerically. On the other hand, there is one sense in which directed graphs can be thought of as a generalization of undirected graphs. If vi and vj are two distinct vertices that share an edge in an undirected graph, then the weight of this edge is guaranteed to appear twice in the weight matrix by virtue of being symmetric. If instead these vertices belong to a directed graph and there is an edge (vi → vj), then this edge appears only once in . Therefore, it is possible to interpret an undirected edge between vi and vj as being equivalent to a pair of directed edges of the same weight, with one connecting vi to vj and the other connecting vj to vi. This is the interpretation we use throughout the rest of the tutorial whenever we refer to undirected graphs. As an example, in Figure 10, we show two equivalent depictions of an undirected graph, with the top image drawn in the usual way and the bottom image drawn using pairs of directed edges. Below these two drawings, the weight matrix of this graph is shown. One must note that interpreting undirected graphs in this way is somewhat atypical; however, it allows us a greater level of generality when dealing with different types of graphs in section 4. Furthermore, this interpretation only applies to edges between distinct vertices; edges that connect vertices to themselves are discussed in section 3.4.
3.3 Vertex Degrees
Once the weight matrix of a graph is known, it is easy to calculate the total weight coming in and out of each vertex. The total incoming weight of a vertex vi can be found by summing over the ith column of and is known as the in-degree of vi: . Conversely, the total outgoing weight of vi is calculated by the sum over the ith row of , , and is known as the out-degree of vi. Since undirected graphs always have bidirectional edges and symmetric weight matrices, the in- and out-degrees of such graphs are always equal and are simply referred to as vertex degrees, denoted by di.4 For example, in the graph of Figure 10 the degrees are d1 = 3, d2 = 4, and d3 = 1. In the more general case of directed graphs, there is no guarantee that . However, summing over all in- or out- degrees for any graph always produces the same number, , which is sometimes referred to as the volume of G. As an example, consider the directed graph in Figure 9e: each vertex has different in- and out-degrees, i.e. , , , , , , , and , but summing over either of the degree types yields vol(G) = 6.3. Nonetheless, some directed graphs can have for each vertex, and such cases are known as balanced graphs (Banderier & Dobrow, 2000; Aldous & Fill, 2002). In keeping with the notation of undirected graphs, we denote the vertex degrees of a balanced graph as . An example of a balanced graph along with its corresponding weight matrix is shown in Figure 11, and a quick check reveals that summing over the rows or columns of indeed yields the same values. Just as a balanced graph is a special case of a directed graph, we can similarly say that an undirected graph is a special case of a balanced graph, and this interpretation is important in section 4.
3.4 Self-Loops
In the examples considered so far, all edges connect pairs of vertices that are distinct, that is, vi ≠ vj. While this is sometimes enforced as a rule, some conventions also allow edges to connect vertices to themselves, which are known as self-loops. For undirected graphs, the standard convention is that a self-loop at vertex vi counts doubly to the vertex degree di, while other edges only count singly, that is, di = 2 × Wii + ∑j ≠ iWij. This somewhat counterintuitive property is typically demonstrated using the degree sum formula (West, 2001). For undirected graphs, this states that each edge contributes twice its weight to the volume. Since self-loops only involve a single vertex, the only way that this rule can be respected is if they count twice as much to the vertex degrees as other edges. A property that we require when dealing with undirected graphs in section 4 is that the vertex degrees are calculated by the row sums of . Clearly, this property is violated by the factor of 2 that applies to undirected self-loops. As a result, in this tutorial we assume that self-loops are always directed, regardless of whether they occur in undirected or directed graphs. This is an atypical definition, since undirected graphs typically are not allowed to have directed edges. However, as can be seen from the examples in Figure 12, this preserves the fact that undirected graphs have symmetric weight matrices, whereas directed graphs have nonsymmetric weight matrices, which is sufficient for the scope of this tutorial.
We close this section by noting some similarities between our definitions of Markov chains and graphs. First, the transition matrices of Markov chains, like the weight matrices of graphs, are nonnegative. Second, in a directed graph, any entry Wij ≠ 0 of the weight matrix describes an outgoing edge from vertex vi to vj, and analogously, any entry Pij ≠ 0 of a transition matrix describes an outgoing transition probability from si to sj. Putting these together, we see that in the most general sense, any Markov chain can be thought of as a directed graph, with being the associated weight matrix. Indeed, this interpretation is precisely what justifies us in visualizing a Markov chain by its transition graph. In the next section, we present some useful results that emerge as a result of this way of thinking about a Markov chain. Finally, for a comprehensive text on graph theory that covers much of the material in this section, we recommend West (2001).
3.5 Eigenspaces of Transition Matrices
Nonnegative matrices have received widespread attention in mathematics, and in particular their eigenvalues and eigenvectors are the focus of spectral graph theory (Chung, 1997). In this section, we apply some results from this field to transition matrices, considering first irreducible chains and subsequently exploring the generalization to reducible chains.
3.5.1 Irreducible Chains
A fundamental result used in spectral graph theory is the Perron-Frobenius theorem, and while a full treatment of it is beyond the scope of this tutorial, we now summarize its key implications for transition matrices of irreducible Markov chains.
(Perron Frobenius Theorem for Irreducible Markov Chains). If is the transition matrix of an irreducible Markov chain, then:
λ = 1 is guaranteed to be an eigenvalue.
λ = 1 is a simple eigenvalue, meaning that it occurs only once.
Upon suitable normalization, the eigenvalue λ = 1 has a left eigenvector equal to the unique stationary distribution and a right eigenvector equal to .
All other eigenvalues have |λ| ≤ 1, where |·| is the complex modulus, meaning that the spectral radius of is 1.5
To illustrate the above theorem, in Figures 13a to 13c, we show the transition graphs and eigenvalue plots of three irreducible Markov chains. The first observation to make is that, in agreement with theorem 6, λ = 1 is an eigenvalue in each case and occurs only once. Furthermore, as a quick exercise, we encourage readers to find the eigenvectors of λ = 1 for each example and normalize them to obtain and . Finally, the eigenvalue plots show that in each example, all eigenvalues indeed lie either on the unit circle (|λ| = 1) or within it (|λ| < 1).
Using our terminology from section 2.3, eigenvalues within the unit circle represent transient structures, transient oscillations, and transient cycles. Of the irreducible chains in Figure 13, only panels a and b have eigenvalues of this type, and in both cases, they are complex conjugate pairs describing transient cycles. Looking at the transition probabilities in each example, it is clear that these transient cycles flow clockwise around the state space.
On the other hand, eigenvalues on the unit circle other than represent persistent oscillations and persistent cycles. Theorem 2 tells us that these are possible only when a chain is periodic. The following result sheds light on this by relating the eigenvalues with |λ| = 1 to the period of a chain (Gebali, 2008):
In simple terms, proposition 6 says that the eigenvalues of with modulus 1 are always dth roots of unity. We can verify this by checking the periodic examples in Figures 13b and 13c. In both cases, the number of eigenvalues on the unit circle is indeed equal to the period of the chain and they are also equally spaced. Furthermore, proposition 6 offers an alternative perspective on how the periodicity affects the persistent behavior of a Markov chain. For example, the chain in Figure 13a has only a single eigenvalue on the unit circle, corresponding to its unique stationary distribution. It is therefore guaranteed to end up in this distribution since all other eigenvalues have |λ| < 1. This is equivalent to the statement that this chain is ergodic, which a quick check of the transition graph confirms. Conversely, the chain in Figure 13b has an additional eigenvalue λ = eπi = −1 on the unit circle by virtue of the fact that it has period 2. Therefore, its persistent behavior can only be fully described using both the unique stationary distribution and the eigenvector associated with λ = −1. For example, we know from theorem 2 that such a chain can get trapped in a persistent oscillation (i.e., ). For any such oscillation, and can always be expressed as a linear combination of and , meaning that this sequence indeed oscillates between two points in the space spanned by these eigenvectors. While this example only involves real eigenvalues and therefore only real eigenvectors, the interpretation extends to d > 2, for which proposition 6 tells us that there must be complex eigenvalues with |λ| = 1. For example, the chain in Figure 13c has period d = 3, and it has the following three eigenvalues on the unit circle: λ1 = 1, , and . Analogous to the d = 2 case, any persistent cycle of this chain can be expressed using the three corresponding eigenvectors, which in the case of λ2 and λ3 must have complex entries. Rather interestingly, this means that for chains with period d > 2, persistent cycles are cycles in a complex space despite being sequences of real valued distributions.
3.5.2 Reducible Chains
Applying spectral graph theory to the transition matrices of reducible Markov chains produces a weaker set of results. For example, the generalization of theorem 6 to the reducible chains is the following:
(Perron Frobenius Theorem for Reducible Markov Chains). If is the transition matrix of a reducible Markov chain, then:
λ = 1 is guaranteed to be an eigenvalue.
The number of linearly independent eigenvectors with λ = 1 is equal to the number r of recurrent communicating classes in the Markov chain.
There are many choices of left and right eigenvectors for λ = 1. However, a convenient choice that mirrors the irreducible case is to choose and as a pair of left and right eigenvectors for each of the recurrent communicating classes, with being the unique stationary distribution associated with each class and being an indicator vector with entry 1 for states in this class and zeros elsewhere.
All other eigenvalues have |λ| ≤ 1, where |·| is the complex modulus, meaning that the spectral radius of is 1.
To understand theorem 7 in more depth, consider the example in Figure 13d. This Markov chain has two recurrent communicating classes, which means that λ = 1 has a multiplicity of 2. We indicate this on the eigenvalue plot by a larger size circle. Furthermore, we color this circle half red and half blue to reflect the fact that we can choose the eigenvectors for λ = 1 based on the two recurrent communicating classes, for example, and as a pair of left and right eigenvectors for the red class and and as a pair of left and right eigenvectors for the blue class. Then, and together span the left eigenspaces of λ = 1 (including all possible stationary distributions), whereas and span the right eigenspaces of λ = 1. It is worth emphasizing that while there is an infinite number of other ways to choose the basis vectors for λ = 1, this is a convenient choice since it is the only one for which all basis vectors have strictly nonnegative entries. For example, consider the choice of and as a first pair of eigenvectors. If we then choose a second pair and that satisfy biorthogonality (see equation 2.46) within this space, they are guaranteed to contain negative entries, for example, and . Thus, the choice of eigenvectors stated in theorem 7 is in some sense special since it is the only one that preserves our intuition that left eigenvectors with λ = 1 correspond to distributions over the state space . For this reason, we henceforth assume this convention when referring to eigenvectors with λ = 1.
Unfortunately, there is no general analogue of equation 3.3 for reducible chains. One reason for this is that like the λ = 1 case, there are many choices of eigenvectors for λ ≠ 1. However, for a recurrent reducible chain, the transition matrix can be written in block diagonal form (see the proof of proposition 4), which means we can mirror the λ = 1 case by choosing eigenvectors with λ ≠ 1 to have nonzero entries only in a single recurrent class. If the kth recurrent class has n states, then the corresponding block is an n × n matrix, meaning that there are n pairs of left and right eigenvectors with nonzero entries for states in this class—one pair with λ = 1 ( and ) and another n − 1 pairs with λ ≠ 1. We can therefore apply an equivalent argument to equation 3.3 for the kth class, but instead using the vector . Thus, for each recurrent class, the left eigenvectors with λ ≠ 1 can also be chosen such that they all sum to zero. Conversely, nonrecurrent chains cannot be written in block diagonal form, which means that this argument cannot be applied. Therefore, some eigenvectors will not sum to zero for such chains, although to our knowledge, this case has not received significant attention so far in the literature.
Looking at the eigenvalue plot of Figure 13d, we see that there are two eigenvalues with |λ| < 1, both of which are real and negative. Using the terminology from section 2.3, they therefore correspond to transient oscillations of the chain. Furthermore, since the chain is recurrent, we can apply the procedure described above and choose one eigenvector to have nonzero entries only in the red class and the other eigenvector to have nonzero entries only in the blue class. With this choice, we see that each transient oscillation takes place on a distinct communicating class, which we indicate on the plot by coloring the eigenvalues red and blue.
In the case of proposition 6, the extension to reducible chains is straightforward since one can simply apply this result individually to each recurrent communicating class of a reducible chain. Therefore, for each class of period d, there are d eigenvalues of modulus 1 that satisfy the same properties as in the irreducible case.
Finally, a couple of similarities between theorems 7 and 6 can be pointed out. First, in both theorems, λ = 1 is guaranteed to be an eigenvalue. Since every eigenvalue has at least one eigenvector, this means that we can always find a left eigenvector with λ = 1. Provided that we choose an eigenvector with nonnegative entries and normalize it to one, it is a stationary distribution of the chain. This therefore justifies our claim from section 2.2 that every finite Markov chain has at least one stationary distribution. Second, in both theorems, the eigenvalues cannot have absolute value greater than one, which is one of the assumptions we made when studying the evolution of a chain in terms of its eigenvectors and eigenvalues in section 2.3, and which justified our partitioning of equation 2.54 into persistent and transient terms.
The results of this section emerge by treating Markov chains as graphs. However, in most graphs, the outgoing edges from each vertex do not sum up to one, meaning that they cannot be interpreted as transition probabilities. Because of this, Markov chains can be more precisely interpreted as a type of normalized graph. This idea is formalized in the next section, where we introduce a well-known method for transforming any graph G into a Markov chain.
4 Random Walks
4.1 Definition
Qualitatively, we can say that the transformation in equation 4.2 is useful when we have a starting graph G that we would like to describe in probabilistic and/or temporal terms. Conversely, if we have a starting Markov chain and transition matrix , knowing some matrix for which equation 4.2 holds can offer insight into the type of relationships between states that give rise to the chain. However, the latter of these two perspectives is partially complicated by the fact that the mapping from to is one-to-many, and there are in fact an infinite number of different graphs G that produce the same Markov chain. As an example, in Figure 14, we show two distinct weight matrices and that get transformed to the same transition matrix using equation 4.2. How can we describe the infinite set of graphs corresponding to a single Markov chain? In principle, it involves undoing the row normalization of equation 4.2. Thus, for a given Markov chain with transition matrix , we consider all possible scalings of the rows of by positive constants. Every such scaling can be described by a diagonal matrix that multiplies from the left to produce a single corresponding weight matrix, . As an illustration, in Figures 14d and 14e, we show the two scaling matrices that undo the row normalization of the transition matrix in Figure 14c and transform it back into the weight matrices and , respectively. The following definition generalizes this to the set of all such weight matrices that can be realized in this way:
which we call the random walk set of .
A few details are worth noting about definition 18. First, since the trivial scaling is allowed, for any Markov chain. Second, the fact that means that each of these matrices is invertible, with also being diagonal and having entries equal to the reciprocals of the diagonals of . Consequently, if and are two weight matrices in the random walk set of a given Markov chain, we can always write , meaning that and are also related simply by a row scaling with positive constants, with describing this scaling. Therefore, the row scaling defined in equation 4.3 effectively partitions the set of all nonnegative matrices into equivalence classes. Finally, since definition 18 allows any nonzero scaling of the rows of , the random walk set of any Markov chain predominantly consists of graphs that are neither undirected nor balanced. In fact, only certain types of Markov chains have random walk sets that contain undirected or balanced graphs, as explained by the following two results:
A Markov chain is recurrent if and only if contains balanced graphs (for the proof, see the appendix).
A recurrent Markov chain is reversible if and only if the balanced graphs in are undirected (for the proof, see the appendix).
As an illustration, in Figures 15a–15c, we show the random walk sets for the three Markov chains that were studied in Figures 6a, 6d, and 6j of section 2.6.7 In each case, the Markov chain is located in the center and is colored in black. Other graphs in the random walk sets are colored based on the graph type (undirected = green, balanced directed = blue, unbalanced = red), and are depicted as miniature graphs without edge weights (except for one representative example of each type). The Markov chains of Figures 15a and 15b are both recurrent, and we indeed see that their random walk sets contain balanced graphs (see theorem 9). In both figures, notice that more unbalanced graphs are drawn to reflect the fact that they are more numerous than the balanced cases. Furthermore, the chain in Figure 15a is reversible, whereas the chain in Figure 15b is nonreversible, meaning that in the former case, all balanced graphs in are undirected, and in the latter case, they are directed (see theorem 9). The Markov chain in Figure 15c is nonrecurrent, and in agreement with theorem 8, it contains only unbalanced graphs. Finally, note that we do not include a corresponding diagram for the semireversible example in Figure 6g. However, since such chains can be made reversible by removing nonrecurrent states, a simple extension of theorem 9 ensures that for such chains, there exist graphs in for which the edges between recurrent states are undirected.
To summarize these observations, in Figures 15d and 15e we show two Venn diagrams that illustrate the relationships between the different types of Markov chains and graphs considered. In Figure 15d, graphs are shown as an outer circle, with balanced graphs as a particular case, and undirected graphs as a special type of balanced graph, graphs ⊃ balanced graphs ⊃ undirected graphs (colored red, blue and green, respectively). In Figure 15e, Markov chains are organized in a similar way, Markov chains ⊃ recurrent chains ⊃ reversible chains. Moreover, the colors in the Markov chain diagram are based on the types of graphs allowed in for each type of chain. For example, reversible chains are shaded in red and green since they correspond to random walks on either undirected or unbalanced graphs.
The balanced graphs belonging to a recurrent Markov chain’s random walk set are in some sense special, since the vertex degrees have a simple relationship to the stationary probabilities:
For example, evaluating for v1 in the balanced graphs in Figures 15a and 15b yields and , respectively, which are indeed the stationary probabilities of state s1 for each of the corresponding Markov chains (see Figures 6b and 6e). This highlights something useful about balanced graphs, which is that the weight matrix allows direct calculation of one of the stationary distributions without having to simulate the random walk. Conversely, for an unbalanced graph, there is no universally valid expression relating stationary probabilities of the random walk to the vertex degrees.
Perhaps the most important conclusion to draw from this section is that one can always describe a reversible Markov chain as a random walk on some undirected graph. Since undirected graphs have symmetric weight matrices and since matrices of this type have received a large amount of study in mathematics, this interpretation provides a number of tools for describing reversible chains in more detail. This is the focus of the next section. Directed graphs are as of yet far less understood, meaning that the same level of description for nonreversible chains is not possible. However, in section 4.3, we explore some cases where concepts can be extended to the directed/non-reversible case. Since balanced directed graphs are less common objects in graph theory, we do not dedicate a section to them and instead consider them briefly as a special case in section 4.3.
4.2 Random Walks on Undirected Graphs
4.2.1 Relationship to Symmetric Matrices
In this section, we explore in more depth the connections between real symmetric matrices and the transition matrices of reversible chains. We start by providing the following two results for real symmetric matrices (Meyer, 2000):
A couple of details can be pointed out about theorem 11. First, by comparing equation 4.6 to our analysis of section 2.3, we see that the columns of are a basis of right eigenvectors of and the rows of are the corresponding dual basis of left eigenvectors. Second, both of these bases are orthonormal since is orthogonal. Third, the matrix contains the eigenvalues of , which are guaranteed to be real since all other matrices in equations 4.6 and 4.7 are real. Finally, it is worth emphasizing the existential condition of theorem 11, since not all choices of eigenvectors of a symmetric matrix obey this result. On one hand, equation 4.6 requires that the sets of left and right eigenvectors are chosen together to be a biorthogonal system. On the other hand, even if we assume this property, when a symmetric matrix has repeated eigenvalues, the corresponding eigenvectors can be nonorthogonal. However, even in this case, it is always possible to apply the Gramm-Schmidt procedure to make them orthonormal, and provided that we find such a basis, we can relate the left and right eigenvectors in the following way:
Let be a real symmetric matrix. Then, if is an orthonormal basis of right eigenvectors of and is the corresponding dual basis, for each eigenvalue λω the left and right eigenvectors are equal: .
Clearly, for any Markov chain, reversible or otherwise, the transition matrix is itself rarely symmetric. Therefore, the results above do not directly apply to . However, reversible Markov chains can be related in a number of ways to symmetric matrices, and by virtue of these relations, they satisfy variants of the results above (Brémaud, 1999). For example, from section 2.6, we know that any flow matrix of a reversible Markov is symmetric. This fact allows us to establish the following analogue of theorem 10:
Furthermore, theorem 9 tells us that there exist certain row scalings that transform the transition matrix of a reversible Markov chain into symmetric matrices. In fact, this is not the only scaling operation that transforms into a matrix of this type. The next result shows that such a matrix is also formed when we scale both the rows and columns of as follows:
The first thing to note about theorem 13 is that the output of the scaling operation is somewhat different from that in theorem 9, since in the former case, there is only a single symmetric matrix, whereas in the latter there are an infinite number. Additionally, equation 4.10 implies that and are similar matrices (section 2.3). Since similar matrices have the same eigenvalues and related sets of eigenvectors, we can use this to establish the following generalizations of theorem 11 and proposition 8:
Let be a reversible Markov chain with transition matrix . Then is reversible if and only if it is diagonalizable with real eigenvalues, and there exists a basis of right eigenvectors that are orthogonal with regard to and a corresponding dual basis of left eigenvectors that are orthogonal with regard to , where is a stationary distribution of the chain (for the proof, see the appendix).
Let be a reversible Markov chain with transition matrix and one of its stationary distributions. Furthermore, let be a set of right eigenvectors of and its dual basis, both of which obey the orthogonality relations of theorem 14. Then, if and are right and left eigenvectors of the same eigenvalue λω, they are related via (for the proof, see the appendix).
Theorem 14 is particularly important for practical reasons. First, the diagonalizability of implies a full set of linearly independent eigenvectors. Linearly independent feature spaces are often desirable from a computational perspective since they (1) reduce the overall redundancy, (2) can express any function in , and (3) ensure that certain matrix operations are well defined. Furthermore, as already explained in section 2, having a diagonalizable transition matrix means that evolving a Markov chain becomes computationally cheaper. Second, since is a real matrix with real eigenvalues, we can always choose an eigenbasis consisting only of real valued vectors. This property is useful because real spaces are often more intuitive to deal with than complex spaces. Furthermore, in many applications of Markov chains, the underlying vector space is required to be real, either due to the semantic nature of the problem or because the algorithm being used is not suited to complex spaces.
While a linearly independent set of eigenvectors is useful, having a basis that is pairwise orthogonal with regard to the standard Euclidean product offers further analytical and numerical benefits. From theorem 14, it is clear that this is not the case for transition matrices of reversible Markov chains. However, the matrix is a normalized symmetric version of , and from the proof of theorem 14, we know that the eigenvalues of these two matrices are the same and their eigenvectors are related simply by a multiplication of . Thus, they contain similar information regarding the relationship between pairs of states in , and so can be used as a surrogate for in situations where orthogonal eigenvectors are required.
This expression appears in the next section, where we use to define a positive semidefinite matrix that has the same eigenvectors.
Collectively, the results of this section demonstrate that transition matrices of reversible chains satisfy similar properties to symmetric matrices, but subject to a different type of normalization. This alternative normalization is important for the next section, and in order to make our analyses there more concise, we end this section by defining the following two coordinate transformations (Liu, 2011):
4.2.2 Normalized Graph Laplacian
A symmetric matrix is positive semidefinite if , or equivalently if all its eigenvalues are nonnegative. Such matrices have a number of numerical properties that make them useful for solving optimization problems, and because of this, they often appear in computational applications. The matrix is not positive-semidefinite, since it has the same eigenvalues as (which can be negative). However, it is straightforward to define a variant of that does have this property, while also having the same eigenvectors as :
is symmetric and positive semidefinite, with eigenvalues in the interval [0, 2].
λ = 0 is guaranteed to be an eigenvalue, and its multiplicity is equal to the number of connected components in G.
If C is a connected component of G, then there is an eigenvector with eigenvalue λ = 0 whose entries are for vi ∈ C and 0 otherwise, or, equivalently, , where is an indicator vector for vertices in C.
Therefore, for fully connected graphs, is the unique eigenvector with λ = 0, where is a vector of ones.
There exist other types of graph Laplacians for undirected graphs, such as the unnormalized Laplacian and the random walk Laplacian . Many useful properties of a graph G can be obtained from graph Laplacians, and they form the basis of spectral graph theory (Chung, 1997). Due to its close relationship to , in this tutorial, we predominantly focus on . In the case of , there is a weaker connection to , which we only discuss briefly. Furthermore, to avoid redundancy, we do not discuss at all, since it is the same as but shifted by the identity. For a concise review of each of these objects, see Wiskott and Schönfeld (2019).
As the name suggests, graph Laplacians are analogues of the Laplace operator. In particular, a number of studies have found links between graph Laplacians and a variant of the Laplace operator on manifolds, known as the Laplace-Beltrami operator (Belkin & Niyogi, 2008; Hein et al., 2007). Broadly speaking, Laplace operators measure how much the value of a function at a point varies from its local average, which is informally related to the notion of mean curvature (Reilly, 1982). Furthermore, the eigenvalues of this operator are nonnegative real numbers and are related to how much the corresponding eigenfunctions vary over the domain of the function (Grebenkov & Nguyen, 2013). All of these properties have analogues in the case of , which justifies the term given to this object. To see this, we start with the following result (von Luxburg, 2007):
which is guaranteed to produce a real nonnegative number since is positive semidefinite (for the proof, see the appendix).
In equation 4.20, the quadratic term measures the difference between the entries of a vector that is related to by normalization with , that is, the vector . Since this is minimized when the entries of this vector are similar for pairs of vertices with large edge weight Wij, it can be interpreted to describe the smoothness of with respect to the connectivity of G. Furthermore, since this measure describes the smoothness of as opposed to that of , it is useful for dealing with graphs that have a very heterogeneous degree structure.
These observations can be related to the random walk on G. Since shares an eigenbasis with and the eigenbasis of is related to that of by theorem 13, we can relate the eigenvectors of to those of as follows:
If is an eigenvector of with eigenvalue λω, then and are left and right eigenvectors of , respectively, with eigenvalue 1 − λω.
Because of this, the eigenvectors of have a similar form to those of , but subject to the coordinate transformations of definition 19. Therefore, they can also be interpreted using the notion of smoothness, with λ = 1 in some sense being the smoothest case.
To illustrate the relationship between the eigenvectors of and , we consider a simple Markov chain consisting of a linear arrangement of 100 states, where only nearest neighbor transitions are allowed (see Figure 16a). Transitions to the left and right occur with probability 0.48 and 0.52, respectively, and the self loops at s1 and s100 mean that staying fixed is allowed in these states. Take a moment to verify that this chain is both ergodic and reversible (hint: in the latter case, try theorem 5). Once we know the stationary probabilities πi of this chain, we can calculate the corresponding matrices and by using equation 4.10 and then equation 4.16. In Figure 16b, we plot the left and right eigenvector of with λ = 1, which are the stationary distribution and , respectively. The stationary probabilities are larger for states near the right end of the chain by virtue of the tendency for rightward transitions in this chain. Additionally, we plot the corresponding eigenvector of with λ = 0, which is . This vector is in some sense an intermediary between and since, in agreement with proposition 10, and (as indicated by the black arrows in Figure 16b).
In Figure 16c, we show four more eigenvectors of having eigenvalues closest to 0. In agreement with our discussion, the eigenvectors get less smooth as λ increases, becoming more oscillatory and resembling trigonometric functions over the state space. In Figures 16d and 16e, we do the same for the left and right eigenvectors of , respectively. The smoothness of these eigenvectors also depends on the size of λ; however, this time we are interested in those with eigenvalues closest to 1, and the smoothness decreases as λ decreases. Furthermore, in comparison to Figure 16c, these eigenvectors have an additional weighting effect across the state space. The left eigenvectors have larger amplitudes for states on the right-hand side, which is intuitive since these states have higher stationary probabilities πi, and the eigenvectors can be obtained from those of by the left coordinate transformation of definition 19. For the right eigenvectors, the weighting is the opposite, with states on the left-hand side having larger amplitudes. This is explained by an equivalent argument, except that this time, we apply the right coordinate transformation to the eigenvectors of . It should be noted that for the purpose of visualization, all eigenvectors in Figures 16c to 16e are normalized to have Euclidean norm 1, even though for the eigenvectors of , this is not the natural normalization (see theorem 14).
Finally, we note that the ordering of eigenvalues used in this section is somewhat different from that in section 2.3. In this section, the eigenvalues of are ordered from 0 up to 2, which corresponds to the eigenvalues of being ordered from 1 to −1. To reflect our interpretation, we call this ordering by smoothness. In section 2, however, the eigenvalues of a transition matrix are ordered by their absolute value, which describes how long the contribution from the corresponding eigenvector persists as the chain evolves. We therefore call this choice ordering by persistence. The suitability of either of these types of ordering depends on the specific problem domain in which a Markov chain is being used. For example, in the case of spectral clustering, they correspond to distinct objectives, as demonstrated in Liu (2011). Furthermore, these two types of ordering appear in Creutzig and Sprekeler (2008), in which the authors compare slowness and predictability as objectives for dimensionality reduction of time series data.
This concludes our treatment of random walks on undirected graphs. As a summary of the results given, in Table 2 we compare the mathematical properties and relationships of , , and . The material presented in this section is particularly important for applications in machine learning and data mining in which the data set can be formulated as a graph. In particular, it underlies work that has been done on problems such as spectral clustering (Meilă & Shi, 2000, 2001; Tishby & Slonim, 2001; Saerens et al., 2004; Liu, 2011; Weinan et al., 2008), manifold learning/graph embedding (Coifman et al., 2005a; Coifman & Lafon, 2006), graph-based classification (Kamvar et al., 2003; Szummer & Jaakkola, 2001; Joachims, 2003), and value function approximation in reinforcement learning (Mahadevan, 2005; Mahadevan & Maggioni, 2007; Petrik, 2007; Stachenfeld et al., 2014, 2017). In the next section, we consider how, if at all, the material presented in this section generalizes to directed graphs.
. | . | . | . |
---|---|---|---|
Relationship to | |||
Diagonalizable | ✔ | ✔ | ✔ |
Symmetric | ✗ | ✔ | ✔ |
Positive semidefinite | ✗ | ✗ | ✔ |
Eigenvalues | λω ∈ [−1, 1] | λω ∈ [−1, 1] | λω ∈ [0, 2] |
Eigenvectors | lin. indep. | orthogonal | orthogonal |
Left eigenvectors | |||
Right eigenvectors |
. | . | . | . |
---|---|---|---|
Relationship to | |||
Diagonalizable | ✔ | ✔ | ✔ |
Symmetric | ✗ | ✔ | ✔ |
Positive semidefinite | ✗ | ✗ | ✔ |
Eigenvalues | λω ∈ [−1, 1] | λω ∈ [−1, 1] | λω ∈ [0, 2] |
Eigenvectors | lin. indep. | orthogonal | orthogonal |
Left eigenvectors | |||
Right eigenvectors |
4.3 Random Walks on Directed Graphs
Broadening our consideration to directed graphs is necessary if we want to describe nonreversible Markov chains. However, since many of the guarantees established in section 4.2 do not hold for directed graphs, this case is a lot harder to treat analytically. In section 4.3.1 we explore the main challenges that occur when applying spectral graph theory to the transition matrices of nonreversible Markov chains, and in section 4.3.2 we describe methods for circumventing these issues. Finally, in section 4.3.3 we define a generalization of to directed graphs, and in section 4.3.4 we present a method for enforcing ergodicity on random walks of directed graphs.
4.3.1 Key Difficulties
Since many of the guarantees established in section 4.2 do not hold for directed graphs, they are a lot harder to treat analytically. Perhaps most important, the transition matrices of nonreversible chains are neither guaranteed to be diagonalizable nor to have real eigenvalues (Weber, 2017), which can be observed even for simple cases such as those shown in Figures 18a–18c. There are nonetheless some cases for which both properties hold (Weber, 2017) (as shown by the example in Figure 18d), but it is still not fully understood to what degree, if at all, the transition structure of a nonreversible chain determines either the diagonalizability of or whether its eigenvalues are real or complex.8 When is nondiagonalizable, it does not have a set of N linearly independent eigenvectors, which can cause numerical issues since in this case, some matrix operations are computationally more expensive or not well defined. Moreover, if the eigenvalues of are complex, then so are its eigenvectors. As in the real case, we can choose to order these eigenvectors based on persistence or smoothness. In the former case, the generalization is somewhat straightforward since |λ| still describes how long each eigenvector typically persists. In the latter case, however, the question of how to generalize the concept of smoothness to complex eigenvectors is nontrivial and is still an actively researched topic in the literature (Sevi et al., 2023; Marques et al., 2020). These factors make analyzing the transition matrices of nonreversible Markov chains more challenging than in the reversible case.
4.3.2 Alternative Methods
One general technique for treating a nondiagonalizable matrix is to add a perturbation so that it becomes diagonalizable. This is based on the notion that diagonalizable matrices densely fill the set of all matrices (Golub & Van Loan, 2013), meaning that it is always possible to find some nearby matrix that is diagonalizable. Pauwelyn and Guerry (2021) develop a method along these lines for dealing with nondiagonalizable transition matrices. In particular, for a starting transition matrix , a perturbation matrix is found such that preserves a number of the spectral properties of and is diagonalizable. However, two limitations of this method are that it has computational complexity for N × N matrices and the resulting transition matrix can still have complex eigenvalues.
Other lines of work have attempted to circumvent these issues by using alternative matrix decompositions, with a prominent example being the real Schur decomposition (Stewart, 1994; Conrad et al., 2016; Weber, 2017; Fackeldey et al., 2018; Ghosh & Bellemare, 2020).9 This decomposition provides a set of real orthogonal basis vectors, known as Schur vectors, that spans the eigenspaces of (Golub & Van Loan, 2013). This basis is not unique and corresponds to some ordering of the eigenvalues of , with the first k Schur vectors spanning the eigenspaces of the first k eigenvectors in this ordering. Therefore, given some Schur decomposition of , if is the set of first k Schur vectors, then for any linear combination , it is guaranteed that . For this reason, Uk is said to be an invariant subspace of (Golub & Van Loan, 2013), and when k < < N it provides a low-dimensional description of the transformation that represents. The real Schur decomposition is therefore a useful alternative to the eigendecomposition. However, so that the basis captures the most important information about , a reordering algorithm is needed to specify which eigenspaces of it should span, and various methods for this have been developed (Ng & Parlett, 1987; Dongarra et al., 1992; Bai & Demmel, 1993; Granat et al., 2009; Brandts, 2002). Furthermore, it is worth emphasizing that in contrast to the eigendecomposition, the real Schur decomposition is guaranteed to exist for any real square matrix , meaning that it sidesteps the issues of nondiagonalizability and complex feature spaces that can occur with transition matrices of nonreversible Markov chains. In the field of machine learning, the real Schur decomposition of transition matrices has been used as a tool for clustering (Fackeldey et al., 2018) as well as for building state representations in reinforcement learning (Ghosh & Bellemare, 2020).
4.3.3 Directed Normalized Graph Laplacian
In section 4.2.2, the normalized graph Laplacian was introduced as a way to get a more precise description of the left and right eigenvectors belonging to transition matrices of reversible Markov chains. Generalizing to directed graphs is challenging since two of its defining features are that it is symmetric and positive semidefinite, neither of which can be satisfied by equation 4.18 if is nonsymmetric. However, various definitions for directed graphs exist, and while some loosen the constraint that should be positive semidefinite (Agaev & Chebotarev, 2005; Caughman & Veerman, 2006; Li & Zhang, 2012; Singh et al., 2016), others strictly enforce this via a type of symmetrization (Chung, 2005). We here focus on the latter type and demonstrate connections that this has to some of the material in section 2.
Perhaps the simplest method along these lines is to symmetrize the weight matrix of a directed graph G to get an alternative weight matrix , for example or , and then use the regular definition of the normalized Laplacian using this new matrix: . Since describes an undirected graph, the resulting object can be interpreted in the same way as section 4.2.2. However, a major drawback of this approach is that the graphs described by and can have very different structural properties. For instance, there is no guarantee that the random walks on these two graphs have stationary distributions that bear any resemblance to one another. Indeed, various studies in machine learning have indicated that symmetrizing leads to a significant erasure of structural information from a directed graph (Pentney & Meilă, 2005; Mahadevan et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007).
The directed normalized Laplacian has been used in various contexts of machine learning as a way to generalize methods that are restricted to undirected graphs. It has been applied to problems such as spectral clustering (Meilă & Pentney, 2007; Huang et al., 2006; Liu, 2011), graph embedding (Chen et al., 2007; Perrault-Joncas & Meilă, 2011), graph-based classification (Zhou et al., 2005), and value function approximation in reinforcement learning (Johns & Mahadevan, 2007). In most of these applications, ergodicity is enforced on the random walk described by , and in the next section, we introduce the standard method for doing this.
4.3.4 Random Surfer Model
As explained in section 2.5, ergodic Markov chains have the useful property that they are guaranteed to converge to a unique stationary distribution, and various reasons were given for why this is desirable in a general context. For directed graphs, it is a particularly beneficial property, since without it, a random walk can get trapped in a small cluster of states, or even a single absorbing state. Because of this, sometimes ergodicity is enforced for random walks on directed graphs (Page et al., 1999; Zhou et al., 2005; Huang et al., 2006; Meilă & Pentney, 2007; Johns & Mahadevan, 2007). We remind readers that one effect this has is that , meaning that the directed normalized Laplacian is well defined.
It is worth noting that there exist many variants of the random surfer model that differ in the assumptions they make about (Berkhin, 2005). For example, teleporting transitions can either be uniformly random or biased toward certain vertices through a set of weights. Furthermore, the parameter α ∈ [0, 1] is known as the damping factor and determines how close the process is to a regular random walk. Typically, it is set close to 1, so that the process still accurately reflects the structure of the underlying graph G.
Due to the teleportation term , from any given state si there is always a nonzero probability to access any other state or to stay in the same state. By virtue of this, such processes are guaranteed to be both irreducible and aperiodic, and therefore ergodic.
4.4 Summary
This concludes our treatment of random walks. The material presented in this section forms a useful framework that connects Markov chains and graphs. On the one hand, describing a Markov chain as a process taking place on a graph is a useful interpretation since it provides intuition about underlying relationships between states. Furthermore, it allows one to apply the tool kit of spectral graph theory to Markov chains. On the other hand, graphs by themselves represent only static relationships between entities, and performing a random walk is one way to describe a graph in terms that are dynamic or temporal. Moreover, the fact that a transition matrix can be easily exponentiated (i.e ) means that a random walk provides information about a graph G at multiple timescales, which is a property that has been exploited in the field of manifold learning (Coifman et al., 2005a, 2005b; Coifman & Lafon, 2006). In this section, many concepts and results from linear algebra are required, for which we recommend Meyer (2000) as a general resource and Stewart (1994) as a more specific summary of the application to transition matrices. Moreover, we recommend Spielman (2019) as a text on spectral graph theory, where readers can find a more in-depth exploration of graph Laplacians.
5 Related Applications
While this tutorial focuses on mathematical concepts, the material has a number of applications in machine learning, as well as computer science more generally. A number of these are mentioned in section 1 and various parts of the main text. In order to provide some concrete examples, this section explores two of the most actively researched areas of application.
5.1 Markov Chain Monte Carlo
where σ2 is the variance of ϕ(x). Therefore, Monte Carlo methods provide an unbiased estimate of q with a standard deviation that scales like .
where ncircle is the number of raindrops observed to fall inside the circle. Clearly, generating truly uniform rainfall in a physical context is not feasible. However, such a model can be easily simulated with the help of pseudo-random number generators. Figure 20a depicts the result of such a simulation for k = 100, for which the resulting Monte Carlo approximation is . While this is not very accurate, equations 5.3 and 5.4 tell us that the approximation should improve on average as k increases. We visualize this in Figure 20b, where k ranges from 1 to 500, and for each value of k, the approximation is carried out 100 times. The blue line shows the mean value of found at k, which gets closer to the true value as k increases, and the gray area shows a single standard deviation, which indeed appears to scale like and indicates that fluctuations from the true value get smaller on average as k increases.
In the example, the Monte Carlo approximation is particularly straightforward because the distribution we needed to sample from was very simple. However, often this is not the case. Indeed, a large portion of Monte Carlo techniques are specifically designed to deal with situations in which sampling directly from the target distribution is difficult or impossible.
In particular, Markov chain Monte Carlo (MCMC) involves the construction of a Markov chain, defined over the sample space of the problem being studied, which is guaranteed to converge to the target distribution. Thus, by initializing such a chain and waiting until it converges, one can eventually generate samples to use for a Monte Carlo approximation. We have already seen that in order to have guarantees of convergence to a unique stationary distribution, a chain needs to be ergodic, and so the goal of MCMC is to find such an ergodic chain for a given stationary distribution . A wide variety of MCMC methods exist, but by far the most famous is the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970). Below we summarize this algorithm, and to maintain consistency with the rest of the tutorial, we focus on the case where the relevant state space is discrete. However, the ideas presented can be generalized to the continuous case, and indeed, most interesting applications of the algorithm involve continuous state spaces.
Two points can be made about the expression above. First, although equation 5.8 involves stationary probabilities, the MH algorithm does not require that is known explicitly. Instead, it is only assumed that some positive function11 proportional to is known, that is, , where Z is a normalizing constant that is typically unknown or otherwise intractable. This means that ratios of stationary probabilities in equation 5.8 can be evaluated without needing to take care of Z: . Second, the denominator on the right-hand side of equation 5.8 is always positive because for any proposed transition and πi ∝ fi > 0 ∀i, meaning that the fraction is always well defined.
Two key challenges facing the application of this algorithm, as well as other MCMC methods more generally, are the following (Johansen & Ludger, 2007). First, the time required for a Markov chain to get close to its stationary distribution, known as the mixing time, can be very long. Because of this, MCMC methods typically involve an initial burn-in period in which samples are discarded. One issue with this procedure is that it is rarely straightforward to judge how long is sufficient for a given target distribution, proposal chain, and initialization. However, in the case of an ergodic and reversible chain, such as in the MH algorithm, some guidance on this can be taken from a well-known result in Markov chain theory, which says that the mixing time for such chains is upper- and lower-bounded by values related to the spectral gap γ = 1 − λ2, where λ2 is the second largest eigenvalue of the associated transition matrix (Levin et al., 2009). Second, samples from a Markov chain that occur close together in time are often highly autocorrelated, which clearly violates the i.i.d. property and reduces the effective sample size. A typical method for treating this, known as thinning, is to use only every mth sample generated from the chain. Something common to both of the issues described above is that they often make MCMC methods very slow in practice. Because of this, techniques for acceleration have received a lot of attention in the literature and are still an active area of research (Robert et al., 2018).
5.2 Reinforcement Learning
Reinforcement learning (RL) is a framework for studying how agents can learn behaviors that maximize reward signals by interacting with their environment. The canonical paradigm for this type of learning are Markov decision processes (MDP), which are based on Markov chains. In this section, we introduce MDPs and outline how the material presented so far can be applied to these models.
MDPs are stochastic control processes, whereby an agent is in a state s at each time point t and chooses an action a from those that are available in s, then finds itself in a state s′ at time t + 1 and receives a scalar reward rt + 1. Formally, this can be defined as a 5-tuple , where is a state space, is an action space, p(s′|s, a) is a transition model describing the probability of moving to s′ when taking action a in state s, r(s, a) is a reward function describing the instantaneous reward received when taking action a in state s, and γ ∈ [0, 1) a discount factor (Sutton & Barto, 2018). Together p and r define the dynamics of the MDP, and both can be deterministic or stochastic, as long as they respect the Markov property, which requires that rt + 1 and st + 1 depend only on st and at. Furthermore, while in the most general case , , and t can be either discrete or continuous, in order to maintain consistency, we here consider the simplest setting where all are discrete.
In this section, we outline a few of the key ways in which Markov chains are important in RL, in particular focusing on the relationships between value functions and transition matrices. One of the assumptions underlying our analysis is that the environment’s transition probabilities and reward function are known a priori. However, in virtually all practical applications, this will not be the case, meaning that must be computed using sampled interactions with the environment. Even in these settings, it is very common that concepts and results from Markov chain theory are relevant, and we recommend Sutton and Barto (2018) as a general text explore this further.
6 Conclusion
The key motivation of this tutorial is to provide a single introductory text on the spectral theory of Markov chains. By bringing together concepts and results from different areas of mathematics, we hope this work is a useful resource for readers aiming to gain a broad, yet concise, overview of the topic. Our presentation involves two different paradigms for interpreting and analyzing Markov chains. Section 2 presents a categorization based on the transition structure and asymptotic behavior, and section 3 formalizes Markov chains as a type of graph. In section 4, these two perspectives are connected by introducing the idea of a random walk, which provides a number of parallels between some categories of Markov chains and certain types of graphs. In particular, one theme that aligns the two perspectives is the distinction between reversible and nonreversible Markov chains on the one hand, and undirected and directed graphs on the other hand, where in both cases, the former option is easier to treat than the latter. With the additional use of results from linear algebra, we arrive at an in-depth description of the eigenvalues and eigenvectors of transition matrices in the reversible case. Furthermore, we discuss various attempts that have been made to generalize spectral methods to the nonreversible case. Finally, section 5 explores two areas of computer science literature in which various concepts and results from the foregoing sections are used. Although the material mostly consists of known results, two novel contributions are the categorization of eigenvalues given in Table 1 and Figure 3, as well as the notion of random walk sets (see definition 18). Since we only assume minimal exposure to concepts from linear algebra and probability theory, and since focus is placed on providing intuition rather than rigorous results, the material of this tutorial is accessible to researchers and students in a variety of quantitative disciplines. For those working in fields related to machine learning and data mining, this work is particularly relevant due to the applications discussed at various points.
Appendix: Proofs
(⇒) Assume that is the transition matrix of a recurrent Markov chain. Then, for any stationary distribution , the flow matrix corresponds to one of the allowed graphs in . Therefore, using the same argument given in the proof of proposition 5, the row and column sums of are the same, meaning that this matrix describes a balanced graph.
and so , which means that is a stationary distribution of the chain. Note that the summation in the fourth expression defines the in-degree of vertex vj, but since the graph G is balanced, we only have one degree dj associated with each vertex. Finally, since isolated vertices are not allowed, each degree must be bigger than zero, meaning similarly that πi > 0 . Hence, there are no transient states, and is recurrent.
See the second part of the proof of theorem 8.
Acknowledgments
We thank Jonathan Hermon for useful discussions on Markov chain theory, as well as Josué Tonelli-Cueto for his insight into the Perron-Frobenius theorem and various concepts in linear algebra.
Notes
In component notation, the outer product between two vectors and is
For continuous time Markov chains, this result is known as Kelly’s lemma.
We restrict edge weights to be positive in order to maintain this notion of strength; however, it is worth noting that some conventions in graph theory allow negative weights.
Since in graph theory unweighted graphs are more commonly studied than weighted graphs, the degree quantities we have defined are sometimes referred to as weighted degrees (Chapman & Mesbahi, 2011), but for simplicity we just use the term degree.
We remind readers that the spectral radius of a square matrix is its largest eigenvalue in absolute value and is denoted . The name relates to the fact that all eigenvalues are contained within a disk of radius centered at the origin of the complex plane.
For this definition to work, we require that all vertices have , meaning that isolated vertices or vertices with only incoming edges are forbidden.
While the examples here each have a single communicating class, the analysis would apply equally to chains with multiple classes.
It should be noted that most of these studies primarily consider nonreversible chains that are ergodic.
Sometimes objects like GC are referred to as complete graphs or fully connected networks. However, such terms typically do not include the possibility of self-loops, which we by definition need since we consider uniform teleportation.
Note that some references make the milder assumption of a nonnegative function, but that this can be made positive by removing states for which πi = 0.
References
Author notes
Note: This is a corrected article. See attached erratum.