## Abstract

Stochastic kernel-based dimensionality-reduction approaches have become popular in the past decade. The central component of many of these methods is a symmetric kernel that quantifies the vicinity between pairs of data points and a kernel-induced Markov chain on the data. Typically, the Markov chain is fully specified by the kernel through row normalization. However, in many cases, it is desirable to impose user-specified stationary-state and dynamical constraints on the Markov chain. Unfortunately, no systematic framework exists to impose such user-defined constraints. Here, based on our previous work on inference of Markov models, we introduce a path entropy maximization based approach to derive the transition probabilities of Markov chains using a kernel and additional user-specified constraints. We illustrate the usefulness of these Markov chains with examples.

## 1  Introduction

Recent technological advances allow the collection of large amounts of high-dimensional data across a range of scientific fields. Examples include gene expression levels in individual cells measured using single-cell RNA sequencing (Moon, Stanley et al., 2017), pixel intensities in handwritten images (LeCun, 1998), and collective neuronal firing data (Sejnowski, Churchland, & Movshon, 2014). It is quite often the case that the high-dimensional data are generated from a small set of underlying factors. As a result, it is possible to embed the data in manifold in a much lower dimension. A central task of dimensionality-reduction methods is to identify these manifolds from sampled data points.

An important class of recently-developed dimensionality-reduction methods relies on a stochastic kernel-based approach. Here, one starts with a positive and symmetric affinity kernel that reflects the proximity between pairs of data points. The kernel forms the basis of a Markov chain on the data. Finally, a lower-dimensional representation is sought that preserves local neighborhoods of data points as quantified by the Markov chain. Popular examples of stochastic kernel-based methods include diffusion maps (Coifman & Lafon, 2006) Laplacian eigenmaps (Belkin & Niyogi, 2003), and t-distributed stochastic neighbor embedding (t-SNE) (Maaten & Hinton, 2008).

The Markov chain employed in these methods is obtained by row normalization of the kernel and is “local”: the probability of transitioning from point a to another point b depends only on the local neighborhood of point a. Moreover, the kernel fully specifies both the stationary state distribution as and the diffusion dynamics of the Markov chain. However, in many applications, it is desirable to impose user-specified constraints on the Markov chain—for example, a prescribed stationary distribution. Here are some examples.

All atom molecular dynamics (MD) simulations are regularly used to explore the high-dimensional conformational landscape of complex biomolecules. In many cases, the important dynamics can be projected on a lower-dimensional reaction coordinate (Hänggi, Talkner, & Borkovec, 1990). Diffusion maps have been used to identify these reaction coordinates from MD simulations (Nadler, Lafon, Coifman, & Kevrekidis, 2006; Ferguson, Panagiotopoulos, Debenedetti, & Kevrekidis, 2010; Ferguson, Panagiotopoulos, Kevrekidis, & Debenedetti, 2011). In many cases, the ensemble of conformations obtained using an MD simulation may fail to reproduce experimental observables (Pitera & Chodera, 2012; Olsson, Wu, Paul, Clementi, & Noé, 2017; Dixit & Dill, 2018). Thus, computational approaches are required to appropriately bias the Markov chain in order to obtain reaction coordinates that are consistent with experimental information.

Another example is from analysis of single cell gene expression data (Moon, Stanley et al., 2017). In any particular tissue of a multicellular organism, cells are found in different stages of differentiation, ranging from stem cells to fully differentiated cells. In recent years, diffusion maps have been introduced to extract patterns of lineage differentiation from single cell gene expression profiles (Haghverdi, Buettner, & Theis, 2015). Recent work has stipulated that the gene expression profiles of stem-like cells are more varied compared to the fully differentiated cells (Teschendorff & Enver, 2017) and assigning expression-dependent stationary probabilities to cells leads to a clearer identification of cell states and dynamics among those states (Jin, MacLean, Peng, & Nie, 2018).

Notably, in these and other examples, there does not exist a systematic framework to manipulate the Markov chains used in stochastic kernel-based dimensionality-reduction methods according to user-prescribed constraints. Previously, we have introduced a path-entropy-based approach to modify Markov chains from constraints focusing on applications in biophysics and statistical physics (Dixit et al., 2018). In those works, one maximizes the entropy of the distribution over stochastic trajectories of the Markov chain subject to user-specified constraints. We have used these path-entropy maximized Markov chains (PNMCs) to model statistical dynamics of biomolecular conformations (Dixit & Dill, 2014; Dixit, Jain, Stock, & Dill, 2015) and biochemical reaction networks (Dixit, 2018a), and to quantify inconsistencies in multicriteria decision-making problems (Dixit, 2018b).

The objective of this letter is as follows. First, we show that the transition probabilities associated with a row-normalized Markov chain commonly used in stochastic dimensionality reduction constitute a local maximum entropy probability distribution. Next, we implement the previously derived maximum path entropy Markov processes as an alternative to the row-normalized Markov chain. Specifically, based on our previous work, we illustrate how to obtain the transition probabilities of a global path-entropy-maximized Markov chain subject to user-specified constraints. We illustrate the advantages of these path-entropy-maximized Markov chains using two examples: constructing the phase transition associated reaction coordinate of a near-critical Ising model and constructing the lineage differentiation tree of a cell population from single cell data.

## 2  Background and Notation

We give a brief description of the traditionally used Markov chains in stochastic kernel-based methods.

### 2.1  Row Normalized Markov Chain (RNMC) on the Data

Consider $N$ data points ${a,b,⋯}$ in $Rn$. A positive and symmetric kernel $Δ(a,b)>0$ is defined on all points $a$ and $b$ in $Rn$. A popular choice is the gaussian kernel (Coifman & Lafon, 2006),
$Δ(a,b)=exp-d(a,b)22ɛ2,$
(2.1)
where $d(a,b)$ is the pairwise $L2$ distance. In equation 2.1, $ɛ$ is the bandwidth of the kernel: $d(a,b)≫ɛ⇒Δ(a,b)→0.$ Alternative forms of the kernel have also been proposed (see, e.g., Coifman & Hirn, 2014; Haghverdi et al., 2015; Bermanis, Wolf, & Averbuch, 2016; Moon, van Dijk et al., 2017).
Typically, given a kernel, the Markov chain on data points is constructed as follows. First, we define $D(a)=∑bΔ(a,b).$$D(a)$ can be seen as a $Δ-$kernel-based density estimator at point $a$. Next, an $α-$dependent family of anisotropic kernels is introduced:
$Δ(α)(a,b)=Δ(a,b)D(a)αD(b)α.$
(2.2)
The $α-$parameterized kernel is row-normalized to obtain transition probabilities $qab(α)$,
$qab(α)=Δ(α)(a,b)Z(α)(a),$
(2.3)
where $Z(α)(a)=∑bΔ(α)(a,b)$ is the local partition function. The stationary distribution of the Markov chain is given by
$pa(α)=Z(α)(a)∑aZ(α)(a).$
(2.4)

The parameter $α∈[0,1]$ tunes the relative importance of the geometry of the lower-dimensional manifold and the density statistics of data points on it (Nadler et al., 2006). For concreteness, consider that the data points $x¯$ are generated according to a Fokker-Planck equation on some domain $Ω∈Rn$ with an equilibrium distribution $p(x¯)(x¯∈Ω)$. In the limit of infinitely many data points $N→∞$, the limiting diffusion process corresponding to the backward operator of equation 2.3 approaches a Fokker-Planck equation as well (Nadler et al., 2006). Notably, $α$ controls its stationary distribution $π(α)(x¯)$. Specifically, $α=0$ corresponds to a Fokker-Planck equation with a stationary distribution $π(x¯)∝p(x¯)2$, and $α=1/2$ corresponds to a Fokker-Planck equation with a stationary distribution $π(α)(x¯)=p(x¯)$. In contrast, $α=1$ corresponds to a Fokker-Planck equation with a constant stationary distribution on $Ω$ (Nadler et al., 2006). We note that the correspondence holds only in the limit of infinite data. For example, the discrete time, discrete state Markov chain defined by transition probabilities in equation 2.3 at $α=1$ does not have a uniform stationary distribution over the data points. For the rest of the letter, we omit the $α-$dependence of the kernel for brevity and specify the value of $α$ whenever necessary.

This row normalization procedure to obtain Markov chains from affinity kernels (or its closely related variants) is used in most stochastic kernel–based methods, including diffusion maps (Coifman & Lafon, 2006), Laplacian eigenmaps (Belkin & Niyogi, 2003), and tSNE (Maaten & Hinton, 2008).

### 2.2  RNMC Is a Local Maximum Entropy Markov Chain

The row-normalized Markov chain described by the transition probabilities in equation 2.3 is in fact a local entropy-maximized Markov chain. Consider that for each data point $a$, we want to find the transition probabilities $qab$ such that average squared distance traversed per unit time step is a specified number, $d¯2(a)$. We maximize the entropy (conditioned on starting at data point $a$) (Dixit et al., 2018),
$Sa=-∑bqablogqab,$
(2.5)
subject to constraints
$∑bqab=1and∑bqabd(a,b)2=d¯2(a).$
(2.6)
Entropy maximization subject to the constraints in equation 2.6 yields the transition probabilities in equation 2.3, where $1/2ɛ2$ is the Lagrange multiplier associated with the distance constraint.

## 3  Path-Normalized Markov Chains

How do we incorporate user-specified constraints in addition to the constraint in equation 2.6 on the Markov chain? In this section, we provide a brief overview of the solution to this problem based on our previous work (Dixit & Dill, 2014, 2018; Dixit et al., 2015; Dixit, 2015, 2018a). For concreteness, we denote by ${qab}$ the transition probabilities of the sought Markov chain and ${pa}$ its stationary distribution. We may either de novo infer a Markov chain from specified constraints (Dixit & Dill, 2014; Dixit et al., 2015; Dixit, 2015) or obtain a least-deformed updated Markov chain with respect to a given prior Markov chain ${kab}$ (with stationary distribution ${pa}$) (Dixit & Dill, 2018; Dixit, 2018a).

Consider a long stationary state path $Γ≡⋯→a1→a2→⋯$ of duration $T≫1$ time steps of the Markov chain with hitherto unknown transition probabilities ${qab}$. The first constraint we introduce (similar to equation 2.6) is the path-ensemble average $d¯2$ of the squared distance traversed by a random walker on the data points. We have
$d¯2=1T⋯+d(a1,a2)2+d(a2,a3)2+⋯≈∑apa∑bqabd(a,b)2.$
(3.1)
The second approximation holds in the limit $T→∞.$ In equation 3.1, ${pa}$ is the stationary distribution of the Markov chain. In addition, other constraints of the form $r¯=∑a,bpaqabrab$ can be introduced.
The maximum entropy Markov chain or the path-normalized Markov chain (PNMC) is found as follows. The path entropy (Dixit & Dill, 2014; 2015, 2018; Dixit (2015)),
$S=∑apaSa=-∑a,bpaqablogqab$
(3.2)
is maximized subject to user-specified constraints using the method of Lagrange multipliers. We have the following constraints on the transition probabilities and the stationary distribution:
$∑bpaqab=pa,∑a,bpaqab=1,∑apaqab=pb$
(3.3)
and
$∑a,bpaqabd(a,b)2=〈d(a,b)2〉=d¯2.$
(3.4)
In addition, we also impose detailed balance:
$paqab=pbqba∀aandb.$
(3.5)

At this stage, we have the option of constraining the stationary distribution (Dixit & Dill, 2014; Dixit et al., 2015). Alternatively, we can maximize the entropy with respect to both the transition probabilities and the stationary distribution (Dixit, 2015). Notably, these two choices lead to qualitatively different Markov chains.

### 3.1  Unknown Stationary Distribution

When the stationary distribution is not constrained, the entropy is maximized with respect to both the transition probabilities and the stationary distribution. In this case, the transition probabilities are given by (see section A.1 in the appendix; Dixit, 2015),
$qab=ν1bη1ν1aΔ(a,b),$
(3.6)
where $η1$ is the Perron-Frobenius eigenvalue of $Δ$ and $ν¯1$ is the corresponding Perron-Frobenius eigenvector. Note that since $Δ$ is symmetric, the left and the right eigenvectors are identical. The stationary distribution resembles the ground state of Schrödinger's equation and is given by (Dixit, 2015)
$pa∝ν1a2.$
(3.7)
We note that finding the Perron eigenvector of the kernel in constructing the PNMC in equation 3.6 does not add extra computational burden since estimation of the diffusion map also requires eigendecomposition of a matrix of the same size.
Notably, the PNMC with an unknown stationary distribution (see equation 3.6) can be recast as an RNMC corresponding to a modified symmetric and anisotropic kernel,
$Δ(ν¯1)(a,b)=ν1aΔ(a,b)ν1b.$
(3.8)
From equation 3.8, it is apparent that the PNMC defined by transition probabilities in equation 3.6 prefers to traverse in regions that are closely connected to each other as quantified by the eigenvector centrality (Newman, 2008).

### 3.2  Imposing User-Prescribed Stationary Distribution

The maximum entropy Markov chain with a user-prescribed stationary distribution ${pa}$ and a constrained path-ensemble average $d¯2$ is given by (see section A.1; Dixit & Dill, 2014; Dixit et al., 2015)
$qab=ρaρbpaΔ(a,b),$
(3.9)
where $Δ(a,b)$ is the same as equation 2.1. The constants ${ρa}$ are the fixed point of the nonlinear matrix equation,
$RΔR1¯=p¯,$
(3.10)
where $R$ is the diagonal matrix with $Raa=ρa$ and $1¯$ is the column vector of ones. When the stationary probabilities are constrained to be equal, the transition probability matrix is symmetric and doubly stochastic. As above, we denote by $1/2ɛ2$ the Lagrange multiplier associated with $d¯2$. Interestingly, enforcing a uniform distribution on the Markov chain is sometimes termed Sinkhorn normalization (Idel, 2016). Indeed, it is known that converting a matrix into a doubly stochastic form may lead to better clustering performance (Zass & Shashua, 2007). Moreover, various fast numerical algorithms have been proposed to solve equation 3.10 with well-established complexity bounds. (See Idel, 2016, for a review.)

We note that the path entropy–based approach allows us to enforce a uniform stationary distribution $pa=1/N$ over data points even when $N$ is finite. We contrast this with the $α-$dependent family of Markov chains with $α=1$ introduced above (see equations 2.3 and 2.4). The $α=1$ chain converges to a uniform distribution only in the limit $N→∞$. Notably, the $N→∞$ limit of the PNMC with uniform distribution converges to the same Fokker-Planck equation as the $α=1$ limit (Dixit & Dill, 2018).

### 3.3  Updating a Prior Markov Chain

Entropy maximization allows us to update a prior Markov chain. Consider that we have a prior Markov chain with transition probabilities ${kab}$ and a stationary distribution ${pa}$ (see Dixit & Dill, 2018, and Dixit, 2018a, for more details). Instead of maximizing the path entropy, we minimize the Küllback-Leibler divergence (Rached, Alajaji, & Campbell, 2004),
$S=∑a,bpaqablogqabkab,$
(3.11)
subject to the above constraints (see equations 3.3 and 3.4).
When the stationary distribution of the updated Markov chain is not constrained, its transition probabilities are given by (Dixit, 2018a)
$qab=ν1bη1ν1aΔ*(a,b),$
(3.12)
where
$Δ*(a,b)=Δ(a,b)kabkba.$
(3.13)
$ν¯1$ is the Perron-Frobenius eigenvector of $Δ*$, and $η1$ is the corresponding eigenvalue.
When the stationary distribution is constrained to a user-specified distribution ${pa}$, the transition probabilities of the Markov chain are given by
$qab=ρaρbpaΔ*(a,b),$
(3.14)
where $Δ*(a,b)$ is given by equation 3.13 and $ρ¯$ is the solution of the nonlinear equation
$RΔ*R1¯=p¯,$
(3.15)
where $R$ is the diagonal matrix with $Raa=ρa$.

### 3.4  Connection with Optimal Transport

Finally, we discuss a curious connection with entropy-regularized optimal transport (Peyré & Cuturi, 2017). Optimal transport theory quantifies the distance between two distributions ${xa}$ and ${ya}$ given a cost matrix $M$ as follows. First, one defines a set $Ux,y$ of positive matrices $P$ such that
$P∈Ux,y⇒∑aPab=yb∀band∑bPab=xa∀a.$
(3.16)
The matrix $Pab$ can be considered a joint probability matrix whose left and right marginals are ${xa}$ and ${ya}$, respectively. The distance $δM(x,y)$ is then given by
$δM(x,y):=∑a,bPabMab,whereP=argmin∑a,bPabMab.$
(3.17)
Notably, while the problem in equation 3.17 is a linear program, it can be regularized with an entropy function (Cuturi, 2013). Interestingly, the regularized problem is much faster to solve and can lead to better clustering of high-dimensional data (Cuturi, 2013). The optimization problem in equation 3.17 modifies to
$δMλ(x,y):=∑a,bPabλMab,wherePλ=argmin∑a,bPabMab-λ∑a,bPablogPab.$
(3.18)

Note that if $xa=ya=pa$, $Pab=paqab$, and $Mab=Δ(a,b)$, then the problem in equation 3.18 is identical to the one of finding a Markov chain with a prescribed stationary distribution (see equation 3.9). In the future, it will be important to explore this connection further.

## 4  Illustrations

### 4.1  Constructing the Reaction Coordinate of a Near-Critical Ising Model

The two-dimensional Ising model is a prototypical complex system studied extensively in statistical physics owing to its simplicity, analytical tractability, and rich phenomenology (Chandler, 1987). Briefly, $N2$ magnetic spins are arranged on an $N×N$ square lattice. Individual spins $σi$ can take two values: $σi=+1$ or 1. The probability of observing any state $σ¯={σ1,σ2,⋯,σN2}$ is given by the Gibbs-Boltzmann formula,
$p(σ¯;β)=1Z(β)exp-βE(σ¯),$
(4.1)
where $β=1/kBT$ is the inverse temperature and the energy $E(σ¯)$ is given by
$E(σ¯)=∑nnσiσj.$
(4.2)
The summation in equation 4.2 is taken over all pairs of nearest neighbors on the lattice. In the limit $N→∞$, the Ising model exhibits a phase transition at $T≈2.26$, below which the average magnetization $〈σ〉$ either assumes the value $〈σ〉≈+1$ or $〈σ〉≈-1$ with an infinitely high barrier to flip the sign of all the spins (Chandler, 1987). Consequently, even for a finite number of spins $N$, sampling the conformational space of the Ising model near the critical temperature is numerically challenging. In general, computational tricks that allow the study of physical systems near their critical point without performing simulations near the critical point are sought after (Panagiotopoulos, 1992).

Often in a physical system at equilibrium with its surroundings, the large-scale fluctuations in the high-dimensional configuration space happen along a much smaller dimensional manifold, usually termed the reaction coordinate (Fukui, 1970). A key question then is to infer these reaction coordinates near the critical temperature. To investigate this, we perform two calculations with the Ising model—one away from the critical point at $kBT=2.4$ and one near the critical point at $kBT=2.25$, respectively. The number of spins in the Ising model was $20×20=400$. At each temperature, we sampled 2000 spin configurations using equation 4.1. Next, we constructed a gaussian kernel where the distance $d(a,b)$ between two spin configurations $σ¯a$ and $σ¯b$ was defined as the $L2$ distance between the two configurations. We chose $α=0$ (see equation 2.3) and $ɛ$ chosen to be the 10th percentile of all pairwise distances between spin configurations.

Next, we constructed the RNMC according to equation 2.3 and obtained its first two diffusion maps using the standard approach. Panels a1 and b1 in Figure 1 show these two diffusion maps for the two temperatures. The colors of individual points show the net magnetization per spin ($〈σ〉=-1$ in black and $〈σ〉=1$ in red). It is clear that the first diffusion map $D1$ correlates strongly with the average magnetization $〈σ〉$ (see panels a2 and b2 of Figure 1). Interestingly, the second diffusion map correlates with the total energy close to the critical point ($kBT=2.25$) but not away from the critical point ($kBT=2.4$) (see panels a3 and b3 of Figure 1). In other words, the diffusion map approach identifies the average spin $〈σ〉$ and the total energy $E(σ¯)$ as the two important reaction coordinates of the Ising model close to the critical temperature.

Figure 1:

(a1) The first two diffusion maps of an RNMC constructed using 2000 randomly sampled spin configurations of the Ising model at $kBT=2.4$. (a2) The dependence of the average magnetization $〈σ〉$ on the first diffusion map at $kBT=2.4$. (a3) The dependence of the total energy on the second diffusion map at $kBT=2.4$. (b1) The first two diffusion maps of an RNMC constructed using 2000 randomly sampled spin configurations of the Ising model at $kBT=2.25$. (a2) The dependence of the average magnetization $〈σ〉$ on the first diffusion map at $kBT=2.25$. (a3) The dependence of the total energy on the second diffusion map at $kBT=2.25$. (c1) The first two diffusion maps of a PNMC constructed using 2000 randomly sampled spin configurations of the Ising model at $kBT=2.4$ and then biased with a stationary distribution given in equation 4.3. (c2) The dependence of the average magnetization $〈σ〉$ on the first diffusion map of the PNMC. (a3) The dependence of the total energy on the second diffusion map of the PNMC.

Figure 1:

(a1) The first two diffusion maps of an RNMC constructed using 2000 randomly sampled spin configurations of the Ising model at $kBT=2.4$. (a2) The dependence of the average magnetization $〈σ〉$ on the first diffusion map at $kBT=2.4$. (a3) The dependence of the total energy on the second diffusion map at $kBT=2.4$. (b1) The first two diffusion maps of an RNMC constructed using 2000 randomly sampled spin configurations of the Ising model at $kBT=2.25$. (a2) The dependence of the average magnetization $〈σ〉$ on the first diffusion map at $kBT=2.25$. (a3) The dependence of the total energy on the second diffusion map at $kBT=2.25$. (c1) The first two diffusion maps of a PNMC constructed using 2000 randomly sampled spin configurations of the Ising model at $kBT=2.4$ and then biased with a stationary distribution given in equation 4.3. (c2) The dependence of the average magnetization $〈σ〉$ on the first diffusion map of the PNMC. (a3) The dependence of the total energy on the second diffusion map of the PNMC.

Next, we examine whether we can reconstruct the reaction coordinate near the critical temperature using the simulation at $kBT=2.4$ (away from the critical temperature). The spin configurations sampled at $kBT=2.4$ are distributed according to equation 4.1 with $β=1/2.4$. In order to reproduce the near-critical behavior at $kBT=2.25$, we constructed a PNMC as follows. As above, we first constructed a gaussian kernel with $α=0$ and $ɛ$ chosen to be the 10th percentile of all pairwise distances. Next, we imposed a stationary distribution on the PNMC given by
$p(σ¯)∝exp-12.25-12.4E(σ¯).$
(4.3)
$p(σ¯)$ in equation 4.3 corresponds to the additional weight each data point must receive in order to reflect the behavior at $T=2.25$.

Notably, as seen in panel c1 of Figure 1, the biasing captures the shape of the two-dimensional reaction coordinate as estimated using the standard diffusion maps approach. Moreover, panels c2 and c3 show that in agreement with the simulation near the critical temperature (see panels b1, b2, and b3), the first diffusion map of the PNMC corresponds to the average magnetization, and the second diffusion map corresponds to the total energy. In other words, the biased PNMC is able to reproduce the reaction coordinates close to the critical temperature. As noted in section 1, diffusion maps are often used to explore the lower-dimensional reaction coordinates from conformations sampled from MD simulations (Ferguson et al., 2011). Our work suggests that imposing user-prescribed constraints on the Markov chain allows us to systematically explore the dependency of the reaction coordinates on perturbations to the system—for example, changes in temperature.

### 4.2  Single Cell Gene Expression in Mouse Hematopoietic Stem Cells

Next, we looked at cell transcription factor abundance profiles at the single cell level in mouse hematopoietic stem cells (Moignard et al., 2013). Data were collected on 597 cells from five different cell types: hematopoietic stem cell (HSC), lymphoid-primed multipotent progenitor (LMPP), megakaryocyteerythroid progenitor (PreMegE), common lymphoid progenitor (CLP), and granulocytemonocyte progenitor (GMP). The known differentiation map (Moignard et al., 2013) of the cell types is given in Figure 2.

Figure 2:

(a) A biologically established differentiation trajectory of HSCs. (b) The differentiation trajectory elucidated using the PNMC-derived diffusion maps. (c) The differentiation trajectory elucidated using the RNMC-derived diffusion maps.

Figure 2:

(a) A biologically established differentiation trajectory of HSCs. (b) The differentiation trajectory elucidated using the PNMC-derived diffusion maps. (c) The differentiation trajectory elucidated using the RNMC-derived diffusion maps.

We constructed a symmetric kernel on the data points using the recently introduced PHATE (potential of heat-diffusion for affinity-based transition embedding) approach (Moon, van Dijk et al., 2017). We used $k=5$ nearest neighbors and the shape parameter equal to $β=8$. The kernel is given by
$Δ(k,β)(a,b)=exp-d(a,b)ɛk(a)β+exp-d(a,b)ɛk(b)β,$
(4.4)
where $ɛk(a)$ is the distance of the $k$th nearest neighbor from data point $a$.
We constructed an RNMC using the phate kernel with $α=0$ in equation 2.3. We also constructed a PNMC with a biased stationary distribution. It is argued that the abundance profiles of individual cells are more variable in the stem cell state compared to the fully differentiated state (Teschendorff & Enver, 2017; Jin et al., 2018). Moreover, it is stipulated that assigning an abundance-variability-dependent stationary distribution to cells may lead to a better resolution of cell state dynamics (Jin et al., 2018). Accordingly, we imposed a stationary distribution on the PNMC that related to the entropy of the abundance profile of the 18 transcription factors. If $xij$ was the abundance of factor $j$ in cell $i$, we estimated the entropy:
$si=-∑jxijlogxij.$
(4.5)
The stationary distribution for a cell $i$ was set to
$pi∝11+exp(-si).$
(4.6)
This stationary distribution favors cells with higher entropy.

In Figure 2a, we show the diffusion maps constructed using the PNMC and in Figure 2b the RNMC. Notably, individual branches of the differentiation profile are better resolved with the PNMC. For example, the PreMegE cell type is better separated from its stem cell HSC (green$→$ cyan). In the same vein, the average distance between different cell type clusters was higher for the PNMC by about $10%$ compared to the RNMC (paired $t$-test $p∼2×10-7$).

## 5  Conclusion

In typical stochastic kernel-based approaches, a local Markov chain (RNMC) is constructed on the data points via row normalization of a positive and symmetric kernel. In this letter, we introduced a global path-entropy-maximization-based alternative normalization approach. Notably, both the stationary distribution and the diffusive properties of the path-entropy-normalized Markov chain (PNMC) can be explicitly controlled by the user, allowing much greater flexibility with respect to the long-time properties of the Markov chain compared to the typical row-normalized Markov chain. We showed how imposition of user-prescribed constraints on the PNMC can be leveraged to gain more insights into the data—for example, in predicting reaction coordinates in complex systems or deciphering the differentiation trajectory of cells from single cell data.

On the one hand, the Markov chains introduced here maximize the path entropy over very long stationary state trajectories. This may induce attraction between distant data points that have a high connectivity (see equation 3.8). On the other hand, the row-normalized Markov chain typically used in diffusion maps represents a maximum entropy Markov chain over a single time step. A straightforward generalization of the current work is to consider entropy maximization over a finite number of steps (Frank & Galinsky, 2014). Notably, recent work suggests that incorporating finite path statistics may improve the quality of dimensionality reduction. For example, Little, Maggioni, and Murphy (2017) have shown that modifying the definition of the pairwise distance to include the connectivity between data points can lead to better embedding properties specifically for clusters of variable shapes. Steinerberger (2016) showed that optimally choosing transition probabilities from Markov chains of multiple path sizes effectively filters out unconnected data points.

## Appendix:  Derivation of Transition Probabilities

### A.1  When the Stationary Probabilities Are Constrained

We maximize the trajectory entropy,
$S=∑apaSa=-∑a,bpaqablogqab,$
(A.1)
subject to constraints
$∑bpaqab=pa,∑a,bpaqab=1,∑apaqab=pb$
(A.2)
and
$∑a,bpaqabd(a,b)2=〈d(a,b)2〉=d¯2.$
(A.3)
We solve the constrained optimization problem using the method of Lagrange multipliers. We write the unconstrained optimization function,
$C=S+∑ala∑bpaqab-pa+∑bmb∑apaqab-pb-12ɛ2∑a,bpaqabd(a,b)2-d¯2,$
(A.4)
$0=-(logqab+1)+la+mb-d2(a,b)2ɛ2$
(A.5)
$⇒qab=ρaλbpaexp-d(a,b)22ɛ2,$
(A.6)
where $ela-1=ρa/pa$ and $emb=λb$. Notably, since $paqab=pbqba$, we also have $ρa=λa∀a$. Thus, the transition probabilities are given by (Dixit and Dill, 2014)
$qab=ρaρbpaexp-d(a,b)22ɛ2$
(A.7)

### A.2  When the Stationary Probabilities Are Not Constrained

When the stationary probabiities are not constrained, we maximize the unconstrained optimization function with respect to $qab$ as well as $pa$. Moreover, we have an additional constraint:
$∑a,bpaqab=1.$
(A.8)
We write the unconstrained optimization function as above:
$C=S+∑ala∑bpaqab-pa+∑bmb∑apaqab-pb-12ɛ2∑a,bpaqabd(a,b)2-d¯2+δ∑a,bpakab-1.$
(A.9)
Differentiating with respect to $qab$,
$0=-(logqab+1)+la+mb-d2(a,b)2ɛ2$
(A.10)
$⇒qab=ρaλbpaexp-d(a,b)22ɛ2+δ.$
(A.11)
Differentiating with respect to $pa$,
$0=-∑bqablogqab+la∑bqab-la+∑bmbqab-ma-12ɛ2∑bqabd(a,b)2+δ∑bqab.$
(A.12)
From equations A.11 and A.12, we have
$la+ma=1.$
(A.13)
Thus,
$qab=ν1bη1ν1aexp-d(a,b)22ɛ2,$
(A.14)
where $ν1a=e-la$ and $η=e-δ$. Imposing the normalization condition $∑bqab=1$ identifies $ν¯1$ as the Perron-Frobenius eigenvector of $Δ$ and $η$ the corresponding eigenvalue (Dixit, 2015).

## Acknowledgments

I thank Manas Rachh and Stefan Steinerberger for fruitful discussions about this letter. I also thank Ronald Coifman for pointing out the analogy with optimal transport.

## References

Belkin
,
M.
, &
Niyogi
,
P.
(
2003
).
Laplacian eigenmaps for dimensionality reduction and data representation
.
Neural Computation
,
15
,
1373
1396
.
Bermanis
,
A.
,
Wolf
,
G.
, &
Averbuch
,
A.
(
2016
).
Diffusion-based kernel methods on Euclidean metric measure spaces
.
Applied and Computational Harmonic Analysis
,
41
,
190
213
.
Chandler
,
D.
(
1987
).
Introduction to modern statistical mechanics
.
New York
:
Oxford University Press
.
Coifman
,
R. R.
, &
Hirn
,
M. J.
(
2014
).
Diffusion maps for changing data
.
Applied and Computational Harmonic Analysis
,
36
,
79
107
.
Coifman
,
R. R.
, &
Lafon
,
S.
(
2006
).
Diffusion maps
.
Applied and Computational Harmonic Analysis
,
21
,
5
30
.
Cuturi
,
M.
(
2013
). Sinkhorn distances: Lightspeed computation of optimal transport. In
C. J. C.
Burgess
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
2292
2300
).
Red Hook, NY
:
Curran
.
Dixit
,
P. D.
(
2015
).
Stationary properties of maximum-entropy random walks
.
Physical Review E
,
92
,
042149
.
Dixit
,
P. D.
(
2018a
).
Communication: Introducing prescribed biases in out-of-equilibrium Markov models
,
Journal of Chemical Physics
,
148
,
091101
.
Dixit
,
P. D.
(
2018b
).
Entropy production rate as a criterion for inconsistency in decision theory
.
Journal of Statistical Mechanics: Theory and Experiments
,
5
,
053408
.
Dixit
,
P. D.
, &
Dill
,
K. A.
(
2014
).
Inferring microscopic kinetic rates from stationary state distributions
.
Journal of Chemical Theory and Computation
,
10
,
3002
3005
.
Dixit
,
P. D.
, &
Dill
,
K.
(
2018
).
Caliber corrected Markov Modeling (C2M2): Correcting equilibrium Markov models
.
Journal of Chemical Theory and Computation
,
14
(
2
),
1111
1119
.
Dixit
,
P. D.
,
Jain
,
A.
,
Stock
,
G.
, &
Dill
,
K. A.
(
2015
).
Inferring transition rates of networks from populations in continuous-time Markov processes
.
Journal of Chemical Theory and Computation
,
11
,
5464
5472
.
Dixit
,
P. D.
,
Wagoner
,
J.
,
Weistuch
,
C.
,
Pressé
,
S.
,
Ghosh
,
K.
, &
Dill
,
K. A.
(
2018
).
Perspective: Maximum caliber is a general variational principle for dynamical systems
.
Journal of Chemical Physics
,
148
,
010901
.
Ferguson
,
A. L.
,
Panagiotopoulos
,
A. Z.
,
Debenedetti
,
P. G.
, &
Kevrekidis
,
I. G.
(
2010
).
Systematic determination of order parameters for chain dynamics using diffusion maps
.
Proceedings of the National Academy of Sciences
,
107
,
13597
13602
.
Ferguson
,
A. L.
,
Panagiotopoulos
,
A. Z.
,
Kevrekidis
,
I. G.
, &
Debenedetti
,
P. G.
(
2011
).
Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach
.
Chemical Physics Letters
,
509
,
1
11
.
Frank
,
L. R.
, &
Galinsky
,
V. L.
(
2014
).
Information pathways in a disordered lattice
.
Physical Review, E
,
89
,
032142
.
Fukui
,
K.
(
1970
).
Formulation of the reaction coordinate
.
Journal of Physical Chemistry
,
74
,
4161
4163
.
Haghverdi
,
L.
,
Buettner
,
F.
, &
Theis
,
F. J.
(
2015
).
Diffusion maps for high-dimensional single-cell analysis of differentiation data
.
Bioinformatics
,
31
,
2989
2998
.
Hänggi
,
P.
,
Talkner
,
P.
, &
Borkovec
,
M.
(
1990
).
Reaction-rate theory: Fifty years after Kramers
.
Reviews of Modern Physics
,
62
,
251
.
Idel
,
M.
(
2016
).
A review of matrix scaling and Sinkhorn's normal form for matrices and positive maps
.
arXiv:1609.06349
.
Jin
,
S.
,
MacLean
,
A. L.
,
Peng
,
T.
, &
Nie
,
Q.
(
2018
).
Scepath: Energy landscape-based inference of transition probabilities and cellular trajectories from single-cell transcriptomic data
.
Bioinformatics
,
1
,
10
.
LeCun
,
Y.
(
1998
).
The MNIST database of handwritten digits
. http://yann.lecun.com/exdb/mnist/
Little
,
A.
,
Maggioni
,
M.
, &
Murphy
,
J. M.
(
2017
).
Path-based spectral clustering: Guarantees, robustness to outliers, and fast algorithms
.
arXiv:1712.06206
.
Maaten
,
L. v. d.
, &
Hinton
,
G.
(
2008
).
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
,
2579
2605
.
Moignard
,
V.
,
Macaulay
,
I. C.
,
Swiers
,
G.
,
Buettner
,
F.
,
Schütte
,
J.
,
Calero-Nieto
,
Göttgens
,
B.
(
2013
).
Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis
.
Nature Cell Biology
,
15
,
363
.
Moon
,
K. R.
,
van Dijk
,
D.
,
Wang
,
Z.
,
Burkhardt
,
D.
,
Chen
,
W.
,
van den Elzen
,
A.
, …
Krishna Swamy
,
S.
(
2017
).
Visualizing transitions and structure for high dimensional data exploration
.
bioRxiv
, p.
120378
.
Moon
,
K. R.
,
Stanley
,
J.
,
Burkhardt
,
D.
,
van Dijk
,
D.
,
Wolf
,
G.
, &
Krishnaswamy
,
S.
(
2017
).
Manifold learning-based methods for analyzing single-cell RNA-sequencing data
.
Current Opinion in Systems Biology
,
7
,
36
46
.
,
B.
,
Lafon
,
S.
,
Coifman
,
R. R.
, &
Kevrekidis
,
I. G.
(
2006
).
Diffusion maps, spectral clustering and reaction coordinates of dynamical systems
.
Applied and Computational Harmonic Analysis
,
21
,
113
127
.
Newman
,
M. E.
(
2008
).
The mathematics of networks
. In
The new Palgrave encyclopedia of economics
(
vol. 2
, pp.
1
12
).
London
:
Palgrave Macmillan
.
Olsson
,
S.
,
Wu
,
H.
,
Paul
,
F.
,
Clementi
,
C.
, &
Noé
,
F.
(
2017
).
Combining experimental and simulation data of molecular processes via augmented Markov models
.
Proceedings of the National Academy of Sciences
,
114
,
8265
8270
.
Panagiotopoulos
,
A. Z.
(
1992
).
Direct determination of fluid phase equilibria by simulation in the Gibbs ensemble: A review
.
Molecular Simulation
,
9
,
1
23
.
Peyré
,
G.
, &
Cuturi
,
M.
(
2017
).
Computational optimal transport
(
Technical Report
).
Pitera
,
J. W.
, &
Chodera
,
J. D.
(
2012
).
On the use of experimental observations to bias simulated ensembles
.
Journal of Chemical Theory and Computation
,
8
,
3445
3451
.
Rached
,
Z.
,
Alajaji
,
F.
, &
Campbell
,
L. L.
(
2004
).
The Kullback-Leibler divergence rate between Markov sources
.
IEEE Transactions on Information Theory
,
50
,
917
921
.
Sejnowski
,
T. J.
,
Churchland
,
P. S.
, &
Movshon
,
J. A.
(
2014
).
Putting big data to good use in neuroscience
.
Nature Neuroscience
,
17
,
1440
.
Steinerberger
,
S.
(
2016
).
A filtering technique for Markov chains with applications to spectral embedding
.
Applied and Computational Harmonic Analysis
,
40
,
575
587
.
Teschendorff
,
A. E.
, &
Enver
,
T.
(
2017
).
Single-cell entropy for accurate estimation of differentiation potency from a cell's transcriptome
.
Nature Communications
,
8
,
15599
.
Zass
,
R.
, &
Shashua
,
A.
(
2007
). Doubly stochastic normalization for spectral clustering. In
B.
Schölkopf
,
J. C.
Platt
, &
T.
Hoffman
(Eds.),
Advances in neural information processing systems
,
19
(pp.
1569
1576
).
Cambridge, MA
:
MIT Press
.