## Abstract

We propose a new divergence on the manifold of probability distributions, building on the entropic regularization of optimal transportation problems. As Cuturi (2013) showed, regularizing the optimal transport problem with an entropic term is known to bring several computational benefits. However, because of that regularization, the resulting approximation of the optimal transport cost does not define a proper distance or divergence between probability distributions. We recently tried to introduce a family of divergences connecting the Wasserstein distance and the Kullback-Leibler divergence from an information geometry point of view (see Amari, Karakida, & Oizumi, 2018). However, that proposal was not able to retain key intuitive aspects of the Wasserstein geometry, such as translation invariance, which plays a key role when used in the more general problem of computing optimal transport barycenters. The divergence we propose in this work is able to retain such properties and admits an intuitive interpretation.

## 1 Introduction

Two major geometrical structures have been introduced on the probability simplex, the manifold of discrete probability distributions. The first one is based on the principle of invariance, which requires that the geometry between probability distributions must be invariant under invertible transformations of random variables. That viewpoint is the cornerstone of the theory of information geometry (Amari, 2016), which acts as a foundation for statistical inference. The second direction is grounded on the theory of optimal transport, which exploits prior geometric knowledge on the base pace in which random variables are valued (Villani, 2003). Computing optimal transport amounts to obtaining a coupling between these two random variables that is optimal in the sense that it has a minimal expected transportation cost between the first and second variables.

However, computing that solution can be challenging and is usually carried out by solving a linear program. Cuturi (2013) considered a relaxed formulation of optimal transport, in which the negative entropy of the coupling is used as a regularizer. We call that approximation of the original optimal transport cost the $C$ function. Entropic regularization provides two major advantages: the resolution of regularized optimal transport problems, which relies on Sinkhorn's algorithm (Sinkhorn, 1964), can be trivially parallelized and is usually faster by several orders of magnitude than the exact solution of the linear program. Unlike the original optimal transport geometry, regularized transport distances are differentiable functions of their inputs, even when the latter are discrete, a property that can be exploited in problems arising from pattern classification and clustering (Cuturi & Doucet, 2014; Cuturi & Peyré, 2016).

The Wasserstein distance between two distributions reflects the metric properties of the base space on which a pattern is defined. Conventional information-theoretic divergences such as the Kullback-Leibler (KL) divergence and the Hellinger divergence fail to capture the geometry of the base space. Therefore, the Wasserstein geometry is useful for applications where the geometry of the base space plays an important role, notably in computer vision, where an image pattern is defined on the two-dimensional base space $R2$. Cuturi and Doucet (2014) is a pioneering work, where the Wasserstein barycenter of various patterns can summarize the common shape of a database of exemplar patterns. Also, more advanced inference tasks use the $C$ function as an output loss (Frogner, Zhang, Mobahi, Araya, & Poggio, 2015), a model fitting loss (Antoine, Marco, & Gabriel, 2016), or a way to learn mappings (Courty, Flamary, Tuia, & Rakotomamonjy, 2017). The Wasserstein geometry was also shown to be useful to estimate generative models, under the name of the Wasserstein GAN (Arjovsky, Chintala, & Bottou, 2017) and related contributions (Aude, Gabriel, & Marco, 2018; Salimans, Zhang, Radford, & Metaxas, 2018). It was also applied to economics and ecological systems (Muzellec, Nock, Patrini, & Nielsen, 2016). All in all, the field of “Wasserstein statistics” is becoming an interesting research topic for exploration.

The $C$ function suffers, however, from a few issues. It is neither a distance nor a divergence, notably because comparing a probability measure with itself does not result in a null discrepancy, namely, if $p$ belongs to the simplex, then $C(p,p)\u22600$. More worrisome, the minimizer of $C(p,q)$ with respect to $q$ is not reached at $q=p$. To solve these issues, we have proposed a first attempt at unifying the information and optimal transport geometrical structures in Amari, Karakida, and Oizumi (2018). However, the information-geometric divergence introduced in that previous work loses some of the beneficial properties inherent in the $C$-function. For example, the $C$-function can be used to extract a common shape as the barycenter of a number of patterns (Cuturi & Doucet, 2014), which our former proposal was not able to. Therefore, it is desirable to define a new divergence from $C$, in the rigorous sense that it is minimized when comparing a measure with itself and, preferably, convex in both arguments, while still retaining the attractive properties of optimal transport.

We propose in this article such a new divergence between probability distributions $p$ and $q$ that is inspired by optimal transport while incorporating elements of information geometry. Its basic ingredient remains the entropic regularization of optimal transport. We show that the barycenters obtained with that new divergence are more sharply defined than those obtained with the original $C$-function, while still keeping the shape-location decomposition property. This article focuses on theoretical aspects, with only very preliminary numerical simulations. Detailed characteristics and simulations of the proposed divergence will be given in the future.

## 2 $C$-Function: Entropy-Regularized Optimal Transportation Plan

The optimal transportation plan is given in the following theorem (Cuturi & Peyré, 2016; Amari et al., 2018).

In our previous paper (Amari et al., 2018), we studied the information geometry of the manifold of optimal transportation plans. We proposed a family of divergences that combines the KL divergence and the Wasserstein distance. However, these divergences are closer in spirit to the KL divergence and therefore lose some crucial properties of the $C$-function. In this work, we define a new family of divergences directly from the $C$-function.

## 3 Divergence Derived from $C$-Function

$\u25a1$

We define a new family of divergences $D\lambda (p,q)$ that also depend on $\lambda $.

Figure 2 compares $C\lambda $ and $D\lambda $ in $Sn-1$. The following theorem is obtained (the proof is given in appendix A).

$D\lambda [p:q]$ is a convex function with respect to $p$ and $q$, satisfying the constraints of equation 3.1. It converges to the Wasserstein distance as $\lambda \u21920$.

## 4 Behavior of $K\u02dc\lambda $

The divergence $D\lambda $ is defined through $K\u02dc\lambda $. We study properties of $K\u02dc\lambda $, including two limiting cases of $\lambda \u21920,\u221e$.

## 5 Right Barycenter of Patterns

We consider the barycenter of image patterns represented as probability measures $p=p(\xi )$ on the plane $\xi =(x,y)\u2208R2$ using divergence $D\lambda $. The plane is discretized into a grid of $n\xd7m$ pixels, and therefore $p$ is a probability vector of size $nm$.

The image $K\u02dc\lambda Sn-1$ is a simplex sitting inside $Sn-1$. Since the $C$-barycenter $qC*$ is not necessarily inside $K\u02dc\lambda Sn-1$, we need to solve the $D$-barycenter problem, equation 5.3, under the constraint that $q\u02dc$ is constrained inside $K\u02dc\lambda Sn-1$. When the $C$-barycenter $qC*$ is inside $K\u02dc\lambda Sn-1$, the $D$-barycenter $qD*$ is simply given by equation 5.5, which is more localized or sharper than $qC*$. When $qC*$ is not inside $K\u02dc\lambda Sn-1$, the solution of equation 5.3 is close to the boundary of the simplex $K\u02dc\lambda Sn-1$, giving a sharper version.

The right $D$-barycenter $qD*(S)$ of $S$ is a sharper (more localized) version of the $C$-barycenter.

Agueh and Carlier (2011) showed that when the ground metric is the quadratic Euclidean distance, the $W$-barycenter has the property that its shape is determined from the shapes of each element $pi$ in $S$, but does not depend on their location; namely, it is translation invariant. The $C$-barycenter also inherits this property, and we show that so does the right $D$-barycenter.

(shape-location separation theorem). The barycenters $qC*(\xi )$ and $qD*(\xi )$ of $p1(\xi ),\u2026,pn(\xi )$ are located at the barycenter of $\xi 1,\u2026,\xi n$, and their shapes are given by the respective barycenters of $p\xaf1(\xi ),\u2026,p\xafn(\xi )$.

$\u25a1$

We show a simple example where $pi$ are shifted $p$, a cat shape (see Figure 3a). Its $C$-barycenter has a blurred shape $K\lambda p$ (see Figure 3b) tending to the uniform distribution as $\lambda \u2192\u221e$. However, the shape of the right $D$-barycenter is exactly the same as $p$ (see Figure 3c).

## 6 Left Barycenter of Patterns

Since we use the assymetric divergence $D\lambda $ to define a barycenter, we may consider another barycenter by putting the unknown barycenter in the left argument of $D\lambda $.

*entropic sharpening*, without proving that the resulting problem was convex. Our work shows that up to a given strength, the entropic sharpening of regularized Wasserstein barycenters remains a convex problem.

The shape-location separation theorem, theorem ^{6}, also holds for the left barycenter. We note that $qi$ and $K\u02dc\lambda qi$ have the same center. So we shift all $q\u02dci=K\u02dc\lambda qi$ to a common location $\xi $. Since the additional term $C\lambda (p,K\lambda p\u02dc)$ is shift invariant, the terms concerning the shapes and locations are separated, as shown in equation 6.1. An example of the left $D$-barycenter of four patterns is shown in Figures 4 and 5.

## 7 Preliminary Simulations

It is important to study properties of the two $D$-barycenters, theoretically as well as by simulations. However, we need enough computational facilities for large-scale simulations. Here, we use the simple exponentiated gradient descent algorithm to obtain the $D$-barycenters as a preliminary study. We will explore more detailed study in forthcoming papers.

We study a toy model searching for the barycenters of the two patterns: a large square and a small square, shown in Figure 6a. We compare the $C$-, right $D$-, and left $D$-barycenters shown in Figure 6b. Both $D$-barycenters give a sharper result compared to the $C$-barycenter for various $\lambda $. However, the right and left $D$-barycenters are very similar. Further studies are necessary to compare the two.

## 8 Conclusion

We defined new divergences between two probability distributions based on the $C$-function, which is the entropy-regularized cost function (Cuturi, 2013). Although it is useful in many applications, it does not satisfy the criterion of a distance or divergence. We defined a new divergence function $D\lambda $ derived from $C\lambda $, which works better than the original $C\lambda $ for some problems—in particular, the barycenter problem. We proved that the minimizer of $C\lambda (p,q)$ is given by $K\u02dc\lambda p$, where $K\u02dc\lambda $ is a diffusion operator depending on the base metric $M$. We studied properties of $K\u02dc\lambda $ showing how it changes as $\lambda $ increases, elucidating properties of $D\lambda $.

We applied $D\lambda $ to obtain the barycenter of a cluster of image patterns. It is proved that the right and left $D$-barycenters keep a good property that the shape and locations of patterns are separated, a merit of the $C$-function based barycenter. Moreover, the $D$-barycenters give even a sharper shape than the $C$-barycenter.

Computing $D$-barycenters provides a few novel numerical challenges. The algorithm we have proposed, essentially a simple multiplicative gradient descent, is not as efficient as that proposed by Benamou et al. (2014) for the $C$-barycenter. We believe there is room to improve our approach and leave this for future work.

## Appendix A: Proof of Convexity of $D\lambda $

By using equation 3.5 and $C0(p,p)=0$, we can confirm that $D\lambda $ converges to the Wasserstein distance as $\lambda \u21920$.