Stochastic kernel-based dimensionality-reduction approaches have become popular in the past decade. The central component of many of these methods is a symmetric kernel that quantifies the vicinity between pairs of data points and a kernel-induced Markov chain on the data. Typically, the Markov chain is fully specified by the kernel through row normalization. However, in many cases, it is desirable to impose user-specified stationary-state and dynamical constraints on the Markov chain. Unfortunately, no systematic framework exists to impose such user-defined constraints. Here, based on our previous work on inference of Markov models, we introduce a path entropy maximization based approach to derive the transition probabilities of Markov chains using a kernel and additional user-specified constraints. We illustrate the usefulness of these Markov chains with examples.
Recent technological advances allow the collection of large amounts of high-dimensional data across a range of scientific fields. Examples include gene expression levels in individual cells measured using single-cell RNA sequencing (Moon, Stanley et al., 2017), pixel intensities in handwritten images (LeCun, 1998), and collective neuronal firing data (Sejnowski, Churchland, & Movshon, 2014). It is quite often the case that the high-dimensional data are generated from a small set of underlying factors. As a result, it is possible to embed the data in manifold in a much lower dimension. A central task of dimensionality-reduction methods is to identify these manifolds from sampled data points.
An important class of recently-developed dimensionality-reduction methods relies on a stochastic kernel-based approach. Here, one starts with a positive and symmetric affinity kernel that reflects the proximity between pairs of data points. The kernel forms the basis of a Markov chain on the data. Finally, a lower-dimensional representation is sought that preserves local neighborhoods of data points as quantified by the Markov chain. Popular examples of stochastic kernel-based methods include diffusion maps (Coifman & Lafon, 2006) Laplacian eigenmaps (Belkin & Niyogi, 2003), and t-distributed stochastic neighbor embedding (t-SNE) (Maaten & Hinton, 2008).
The Markov chain employed in these methods is obtained by row normalization of the kernel and is “local”: the probability of transitioning from point a to another point b depends only on the local neighborhood of point a. Moreover, the kernel fully specifies both the stationary state distribution as and the diffusion dynamics of the Markov chain. However, in many applications, it is desirable to impose user-specified constraints on the Markov chain—for example, a prescribed stationary distribution. Here are some examples.
All atom molecular dynamics (MD) simulations are regularly used to explore the high-dimensional conformational landscape of complex biomolecules. In many cases, the important dynamics can be projected on a lower-dimensional reaction coordinate (Hänggi, Talkner, & Borkovec, 1990). Diffusion maps have been used to identify these reaction coordinates from MD simulations (Nadler, Lafon, Coifman, & Kevrekidis, 2006; Ferguson, Panagiotopoulos, Debenedetti, & Kevrekidis, 2010; Ferguson, Panagiotopoulos, Kevrekidis, & Debenedetti, 2011). In many cases, the ensemble of conformations obtained using an MD simulation may fail to reproduce experimental observables (Pitera & Chodera, 2012; Olsson, Wu, Paul, Clementi, & Noé, 2017; Dixit & Dill, 2018). Thus, computational approaches are required to appropriately bias the Markov chain in order to obtain reaction coordinates that are consistent with experimental information.
Another example is from analysis of single cell gene expression data (Moon, Stanley et al., 2017). In any particular tissue of a multicellular organism, cells are found in different stages of differentiation, ranging from stem cells to fully differentiated cells. In recent years, diffusion maps have been introduced to extract patterns of lineage differentiation from single cell gene expression profiles (Haghverdi, Buettner, & Theis, 2015). Recent work has stipulated that the gene expression profiles of stem-like cells are more varied compared to the fully differentiated cells (Teschendorff & Enver, 2017) and assigning expression-dependent stationary probabilities to cells leads to a clearer identification of cell states and dynamics among those states (Jin, MacLean, Peng, & Nie, 2018).
Notably, in these and other examples, there does not exist a systematic framework to manipulate the Markov chains used in stochastic kernel-based dimensionality-reduction methods according to user-prescribed constraints. Previously, we have introduced a path-entropy-based approach to modify Markov chains from constraints focusing on applications in biophysics and statistical physics (Dixit et al., 2018). In those works, one maximizes the entropy of the distribution over stochastic trajectories of the Markov chain subject to user-specified constraints. We have used these path-entropy maximized Markov chains (PNMCs) to model statistical dynamics of biomolecular conformations (Dixit & Dill, 2014; Dixit, Jain, Stock, & Dill, 2015) and biochemical reaction networks (Dixit, 2018a), and to quantify inconsistencies in multicriteria decision-making problems (Dixit, 2018b).
The objective of this letter is as follows. First, we show that the transition probabilities associated with a row-normalized Markov chain commonly used in stochastic dimensionality reduction constitute a local maximum entropy probability distribution. Next, we implement the previously derived maximum path entropy Markov processes as an alternative to the row-normalized Markov chain. Specifically, based on our previous work, we illustrate how to obtain the transition probabilities of a global path-entropy-maximized Markov chain subject to user-specified constraints. We illustrate the advantages of these path-entropy-maximized Markov chains using two examples: constructing the phase transition associated reaction coordinate of a near-critical Ising model and constructing the lineage differentiation tree of a cell population from single cell data.
2 Background and Notation
We give a brief description of the traditionally used Markov chains in stochastic kernel-based methods.
2.1 Row Normalized Markov Chain (RNMC) on the Data
The parameter tunes the relative importance of the geometry of the lower-dimensional manifold and the density statistics of data points on it (Nadler et al., 2006). For concreteness, consider that the data points are generated according to a Fokker-Planck equation on some domain with an equilibrium distribution . In the limit of infinitely many data points , the limiting diffusion process corresponding to the backward operator of equation 2.3 approaches a Fokker-Planck equation as well (Nadler et al., 2006). Notably, controls its stationary distribution . Specifically, corresponds to a Fokker-Planck equation with a stationary distribution , and corresponds to a Fokker-Planck equation with a stationary distribution . In contrast, corresponds to a Fokker-Planck equation with a constant stationary distribution on (Nadler et al., 2006). We note that the correspondence holds only in the limit of infinite data. For example, the discrete time, discrete state Markov chain defined by transition probabilities in equation 2.3 at does not have a uniform stationary distribution over the data points. For the rest of the letter, we omit the dependence of the kernel for brevity and specify the value of whenever necessary.
2.2 RNMC Is a Local Maximum Entropy Markov Chain
3 Path-Normalized Markov Chains
How do we incorporate user-specified constraints in addition to the constraint in equation 2.6 on the Markov chain? In this section, we provide a brief overview of the solution to this problem based on our previous work (Dixit & Dill, 2014, 2018; Dixit et al., 2015; Dixit, 2015, 2018a). For concreteness, we denote by the transition probabilities of the sought Markov chain and its stationary distribution. We may either de novo infer a Markov chain from specified constraints (Dixit & Dill, 2014; Dixit et al., 2015; Dixit, 2015) or obtain a least-deformed updated Markov chain with respect to a given prior Markov chain (with stationary distribution ) (Dixit & Dill, 2018; Dixit, 2018a).
At this stage, we have the option of constraining the stationary distribution (Dixit & Dill, 2014; Dixit et al., 2015). Alternatively, we can maximize the entropy with respect to both the transition probabilities and the stationary distribution (Dixit, 2015). Notably, these two choices lead to qualitatively different Markov chains.
3.1 Unknown Stationary Distribution
3.2 Imposing User-Prescribed Stationary Distribution
We note that the path entropy–based approach allows us to enforce a uniform stationary distribution over data points even when is finite. We contrast this with the dependent family of Markov chains with introduced above (see equations 2.3 and 2.4). The chain converges to a uniform distribution only in the limit . Notably, the limit of the PNMC with uniform distribution converges to the same Fokker-Planck equation as the limit (Dixit & Dill, 2018).
3.3 Updating a Prior Markov Chain
3.4 Connection with Optimal Transport
4.1 Constructing the Reaction Coordinate of a Near-Critical Ising Model
Often in a physical system at equilibrium with its surroundings, the large-scale fluctuations in the high-dimensional configuration space happen along a much smaller dimensional manifold, usually termed the reaction coordinate (Fukui, 1970). A key question then is to infer these reaction coordinates near the critical temperature. To investigate this, we perform two calculations with the Ising model—one away from the critical point at and one near the critical point at , respectively. The number of spins in the Ising model was . At each temperature, we sampled 2000 spin configurations using equation 4.1. Next, we constructed a gaussian kernel where the distance between two spin configurations and was defined as the distance between the two configurations. We chose (see equation 2.3) and chosen to be the 10th percentile of all pairwise distances between spin configurations.
Next, we constructed the RNMC according to equation 2.3 and obtained its first two diffusion maps using the standard approach. Panels a1 and b1 in Figure 1 show these two diffusion maps for the two temperatures. The colors of individual points show the net magnetization per spin ( in black and in red). It is clear that the first diffusion map correlates strongly with the average magnetization (see panels a2 and b2 of Figure 1). Interestingly, the second diffusion map correlates with the total energy close to the critical point () but not away from the critical point () (see panels a3 and b3 of Figure 1). In other words, the diffusion map approach identifies the average spin and the total energy as the two important reaction coordinates of the Ising model close to the critical temperature.
Notably, as seen in panel c1 of Figure 1, the biasing captures the shape of the two-dimensional reaction coordinate as estimated using the standard diffusion maps approach. Moreover, panels c2 and c3 show that in agreement with the simulation near the critical temperature (see panels b1, b2, and b3), the first diffusion map of the PNMC corresponds to the average magnetization, and the second diffusion map corresponds to the total energy. In other words, the biased PNMC is able to reproduce the reaction coordinates close to the critical temperature. As noted in section 1, diffusion maps are often used to explore the lower-dimensional reaction coordinates from conformations sampled from MD simulations (Ferguson et al., 2011). Our work suggests that imposing user-prescribed constraints on the Markov chain allows us to systematically explore the dependency of the reaction coordinates on perturbations to the system—for example, changes in temperature.
4.2 Single Cell Gene Expression in Mouse Hematopoietic Stem Cells
Next, we looked at cell transcription factor abundance profiles at the single cell level in mouse hematopoietic stem cells (Moignard et al., 2013). Data were collected on 597 cells from five different cell types: hematopoietic stem cell (HSC), lymphoid-primed multipotent progenitor (LMPP), megakaryocyteerythroid progenitor (PreMegE), common lymphoid progenitor (CLP), and granulocytemonocyte progenitor (GMP). The known differentiation map (Moignard et al., 2013) of the cell types is given in Figure 2.
In Figure 2a, we show the diffusion maps constructed using the PNMC and in Figure 2b the RNMC. Notably, individual branches of the differentiation profile are better resolved with the PNMC. For example, the PreMegE cell type is better separated from its stem cell HSC (green cyan). In the same vein, the average distance between different cell type clusters was higher for the PNMC by about compared to the RNMC (paired -test ).
In typical stochastic kernel-based approaches, a local Markov chain (RNMC) is constructed on the data points via row normalization of a positive and symmetric kernel. In this letter, we introduced a global path-entropy-maximization-based alternative normalization approach. Notably, both the stationary distribution and the diffusive properties of the path-entropy-normalized Markov chain (PNMC) can be explicitly controlled by the user, allowing much greater flexibility with respect to the long-time properties of the Markov chain compared to the typical row-normalized Markov chain. We showed how imposition of user-prescribed constraints on the PNMC can be leveraged to gain more insights into the data—for example, in predicting reaction coordinates in complex systems or deciphering the differentiation trajectory of cells from single cell data.
On the one hand, the Markov chains introduced here maximize the path entropy over very long stationary state trajectories. This may induce attraction between distant data points that have a high connectivity (see equation 3.8). On the other hand, the row-normalized Markov chain typically used in diffusion maps represents a maximum entropy Markov chain over a single time step. A straightforward generalization of the current work is to consider entropy maximization over a finite number of steps (Frank & Galinsky, 2014). Notably, recent work suggests that incorporating finite path statistics may improve the quality of dimensionality reduction. For example, Little, Maggioni, and Murphy (2017) have shown that modifying the definition of the pairwise distance to include the connectivity between data points can lead to better embedding properties specifically for clusters of variable shapes. Steinerberger (2016) showed that optimally choosing transition probabilities from Markov chains of multiple path sizes effectively filters out unconnected data points.
Appendix: Derivation of Transition Probabilities
A.1 When the Stationary Probabilities Are Constrained
A.2 When the Stationary Probabilities Are Not Constrained
I thank Manas Rachh and Stefan Steinerberger for fruitful discussions about this letter. I also thank Ronald Coifman for pointing out the analogy with optimal transport.