Abstract

We propose a new divergence on the manifold of probability distributions, building on the entropic regularization of optimal transportation problems. As Cuturi (2013) showed, regularizing the optimal transport problem with an entropic term is known to bring several computational benefits. However, because of that regularization, the resulting approximation of the optimal transport cost does not define a proper distance or divergence between probability distributions. We recently tried to introduce a family of divergences connecting the Wasserstein distance and the Kullback-Leibler divergence from an information geometry point of view (see Amari, Karakida, & Oizumi, 2018). However, that proposal was not able to retain key intuitive aspects of the Wasserstein geometry, such as translation invariance, which plays a key role when used in the more general problem of computing optimal transport barycenters. The divergence we propose in this work is able to retain such properties and admits an intuitive interpretation.

1  Introduction

Two major geometrical structures have been introduced on the probability simplex, the manifold of discrete probability distributions. The first one is based on the principle of invariance, which requires that the geometry between probability distributions must be invariant under invertible transformations of random variables. That viewpoint is the cornerstone of the theory of information geometry (Amari, 2016), which acts as a foundation for statistical inference. The second direction is grounded on the theory of optimal transport, which exploits prior geometric knowledge on the base pace in which random variables are valued (Villani, 2003). Computing optimal transport amounts to obtaining a coupling between these two random variables that is optimal in the sense that it has a minimal expected transportation cost between the first and second variables.

However, computing that solution can be challenging and is usually carried out by solving a linear program. Cuturi (2013) considered a relaxed formulation of optimal transport, in which the negative entropy of the coupling is used as a regularizer. We call that approximation of the original optimal transport cost the C function. Entropic regularization provides two major advantages: the resolution of regularized optimal transport problems, which relies on Sinkhorn's algorithm (Sinkhorn, 1964), can be trivially parallelized and is usually faster by several orders of magnitude than the exact solution of the linear program. Unlike the original optimal transport geometry, regularized transport distances are differentiable functions of their inputs, even when the latter are discrete, a property that can be exploited in problems arising from pattern classification and clustering (Cuturi & Doucet, 2014; Cuturi & Peyré, 2016).

The Wasserstein distance between two distributions reflects the metric properties of the base space on which a pattern is defined. Conventional information-theoretic divergences such as the Kullback-Leibler (KL) divergence and the Hellinger divergence fail to capture the geometry of the base space. Therefore, the Wasserstein geometry is useful for applications where the geometry of the base space plays an important role, notably in computer vision, where an image pattern is defined on the two-dimensional base space R2. Cuturi and Doucet (2014) is a pioneering work, where the Wasserstein barycenter of various patterns can summarize the common shape of a database of exemplar patterns. Also, more advanced inference tasks use the C function as an output loss (Frogner, Zhang, Mobahi, Araya, & Poggio, 2015), a model fitting loss (Antoine, Marco, & Gabriel, 2016), or a way to learn mappings (Courty, Flamary, Tuia, & Rakotomamonjy, 2017). The Wasserstein geometry was also shown to be useful to estimate generative models, under the name of the Wasserstein GAN (Arjovsky, Chintala, & Bottou, 2017) and related contributions (Aude, Gabriel, & Marco, 2018; Salimans, Zhang, Radford, & Metaxas, 2018). It was also applied to economics and ecological systems (Muzellec, Nock, Patrini, & Nielsen, 2016). All in all, the field of “Wasserstein statistics” is becoming an interesting research topic for exploration.

The C function suffers, however, from a few issues. It is neither a distance nor a divergence, notably because comparing a probability measure with itself does not result in a null discrepancy, namely, if p belongs to the simplex, then C(p,p)0. More worrisome, the minimizer of C(p,q) with respect to q is not reached at q=p. To solve these issues, we have proposed a first attempt at unifying the information and optimal transport geometrical structures in Amari, Karakida, and Oizumi (2018). However, the information-geometric divergence introduced in that previous work loses some of the beneficial properties inherent in the C-function. For example, the C-function can be used to extract a common shape as the barycenter of a number of patterns (Cuturi & Doucet, 2014), which our former proposal was not able to. Therefore, it is desirable to define a new divergence from C, in the rigorous sense that it is minimized when comparing a measure with itself and, preferably, convex in both arguments, while still retaining the attractive properties of optimal transport.

We propose in this article such a new divergence between probability distributions p and q that is inspired by optimal transport while incorporating elements of information geometry. Its basic ingredient remains the entropic regularization of optimal transport. We show that the barycenters obtained with that new divergence are more sharply defined than those obtained with the original C-function, while still keeping the shape-location decomposition property. This article focuses on theoretical aspects, with only very preliminary numerical simulations. Detailed characteristics and simulations of the proposed divergence will be given in the future.

2  C-Function: Entropy-Regularized Optimal Transportation Plan

The general transportation problem is concerned with the optimal transportation of commodities, initially distributed according to a distribution p, so that they end up being distributed as another distribution q. In its full generality, such a transport can be carried out on a metric manifold, and both p and q can be continuous measures on that manifold. We consider in this article the discrete case, where that metric space is of finite size n, namely, X={1,2,,n}. We normalize the total number of commodities such that it sums to 1; p and q are therefore probability vectors of size n in the n-1-dimensional simplex,
Sn-1=pRn|ipi=1,pi0.
(2.1)
We leave out extensions to the continuous case for future work. Let Mij be the cost of transporting a unit of commodity from bin i to bin j, usually defined as a distance (or a suitable power thereof) between points i and j in X. In what follows, we assume that Mij>0 for ij and Mii=0. A transportation plan P=(Pij)R+n×n is a joint (probability) distribution over X×X that describes at each entry Pij the amount of commodities sent from bin i to bin j. Given a source distribution p and a target distribution q, the set of transport plans allowing a transfer from p to q is defined as
U(p,q)=PR+n×n:in,j=1nPij=pi,jn,i=1nPij=qj.
(2.2)
Note that if a matrix is in U(p,q), then its transpose is in U(q,p).
The transportation cost of a plan P is defined as the dot product of P with the cost matrix M=Mij,
M,P=ijMijPij.
(2.3)
Its minimum among all feasible plans,
W(p,q)=minPU(p,q)M,P,
(2.4)
is the Wasserstein distance on Sn-1 parameterized by the ground metric (Cuturi, 2013) studied the regularization of that problem using entropy,
H(P)=-ijPijlogPij,
(2.5)
to consider the problem of minimizing
L(P)=M,P-λH(P),
(2.6)
where λ>0 is a regularization strength. We call its minimum the C-function:
Cλ(p,q)=minPU(p,q)L(P).
(2.7)
The Cλ function is a useful proxy for the Wasserstein distance, with favorable computational properties, and has appeared in several applications as a useful alternative to information-geometric divergences such as the KL divergence.

The optimal transportation plan is given in the following theorem (Cuturi & Peyré, 2016; Amari et al., 2018).

Theorem 1.
The optimal transportation plan Pλ* is given by
Pλ*=caibjKijij,
(2.8)
K=exp-Mijλij,
(2.9)
where c, a normalization constant, and vectors a, bSn-1 are determined from p and q such that the sender and receiver conditions (namely, marginal conditions) are satisfied.

In our previous paper (Amari et al., 2018), we studied the information geometry of the manifold of optimal transportation plans. We proposed a family of divergences that combines the KL divergence and the Wasserstein distance. However, these divergences are closer in spirit to the KL divergence and therefore lose some crucial properties of the C-function. In this work, we define a new family of divergences directly from the C-function.

3  Divergence Derived from C-Function

The C-function Cλ(p,q) does not satisfy the requirements for a distance or divergence. For such a function Δ, for any p,qSn-1,
Δ(p,q)Δ(p,p)=0.
(3.1)
In order to find the minimizer q* of Cλ(p,q) for a given p, we use the exponentiated version Kλ of the cost M depending on λ given in equation 2.9.
We further define its conditional version:
K˜λ=Kji,λKj·ij,
(3.2)
Kj·=iKji.
(3.3)
K˜λ is a linear operator from Sn-1 into Sn-1, and we will use for convenience the notation
q˜=K˜λq.
(3.4)
K˜λ is a monotonic shrinkage operator mapping Sn-1 in its interior (see Figure 1), and K˜λSn-1K˜λ'Sn-1 for λ>λ'. When λ=0, K˜λ is the identity mapping
K˜0q=q.
(3.5)
As λ tends to infinity, K˜λSn-1 converges to a single point, the center of Sn-1, 1/n, where 1=(1,1,,1)T, and
K˜=1n11.
(3.6)
Hence, for any q, K˜q is the uniform distribution 1/n.
Figure 1:

Shrinkage operator K˜λ.

Figure 1:

Shrinkage operator K˜λ.

Theorem 2.
The minimizer of Cλ(p,q) with respect to q, given p, is
q*=K˜λp.
(3.7)
Proof.
By differentiation, we have the equation
qCλ(p,q*)=0
(3.8)
to determine the minimizer q*. This gives the condition
bi=1,
(3.9)
for q* (see the duality theorem in Amari et al., 2018). Hence, the optimal transportation plan from p to q* is given by
Pij*=caiKij,
(3.10)
where suffix λ is omitted to alleviate notations. From
jcaiKij=pi,
(3.11)
we have
cai=piKi·.
(3.12)
So
q*=K˜λp,
(3.13)
proving the theorem.

We define a new family of divergences Dλ(p,q) that also depend on λ.

Definition 1.
Dλ[p:q]=(1+λ)Cλ(p,K˜λq)-Cλ(p,K˜λp).
(3.14)

Figure 2 compares Cλ and Dλ in Sn-1. The following theorem is obtained (the proof is given in appendix A).

Figure 2:

Comparison of Cλ and Dλ. (a) Cλ as the function of q. Cλ is minimized when q=K˜p. (b) Dλ as the function of q. Dλ is minimized when q=p.

Figure 2:

Comparison of Cλ and Dλ. (a) Cλ as the function of q. Cλ is minimized when q=K˜p. (b) Dλ as the function of q. Dλ is minimized when q=p.

Theorem 3.

Dλ[p:q] is a convex function with respect to p and q, satisfying the constraints of equation 3.1. It converges to the Wasserstein distance as λ0.

4  Behavior of K˜λ

The divergence Dλ is defined through K˜λ. We study properties of K˜λ, including two limiting cases of λ0,.

We first consider the case for small λ to see how K˜λ behaves, assuming that X has a graphical structure. We assume that Mii=0, Mij=1 when i is a nearest neighbor of j, and Mij>1 otherwise. By putting
ε=exp-1λ,λ=-1logε,
(4.1)
we have
Kij=exp-Mijλ=εMij,
(4.2)
K˜i|j=εMjiiεMji.
(4.3)
Then, by neglecting higher-order terms of ε, we have
K˜i|j=1-|N(i)|ε,i=j,ε,iN(j),0,otherwise,
(4.4)
where N(j) is the set of nearest neighbors of j. When X is a plane consisting of n×n pixels, a neighbor consists of four pixels, |N(j)|=4, except for boundary pixels. We see that K˜i|j is approximated by the discrete Laplacian operator Δ,
K˜=(1-ε)I+εΔ.
(4.5)
This shows that K˜ is a diffusion operator, flattening pattern q, that is, shifting q toward the uniform distribution 1/n.
The inverse of K˜ is
K˜-1=(1+ε)I-εΔ.
(4.6)
This is the inverse diffusion operator, which sharpens q by emphasizing larger components.
In order to make clear the character of diffusion without assuming λ is small, we consider a continuous pattern p=p(ξ), ξRn and that metric
Mξ,ξ'=ξ-ξ'2.
(4.7)
Then,
Kλξ,ξ'=exp-ξ-ξ'2λ,
(4.8)
and we easily have
K˜λξ|ξ'=1(πλ)nexp-ξ-ξ'2λ.
(4.9)
This is a diffusion kernel. When p is a gaussian distribution,
p(ξ)=exp-12σ2|ξ|2,
(4.10)
we have
K˜λp=exp-12τ2|ξ|2,
(4.11)
with
τ2=σ2+λ2.
(4.12)
Hence, K˜λp is gaussian, where the variance τ2 is increased by λ/2, blurring the original p.
Finally, we consider the case when λ is large enough, studying the limiting behavior as λ. When λ is large, we expand Kλ as
Kij,λ=exp-Mijλ=1-Mijλ,
(4.13)
Kj·,λ=n1-m¯j·λ,m¯j·=1niMji,
(4.14)
obtaining
K˜i|j,λ=Kji,λKj·,λ=1n1-m˜ijλ,m˜ij=Mji-m¯j.
(4.15)
Hence,
K˜λ(q-p)i=1nλjm˜ijqj-pj,
(4.16)
showing that this is of order 1/λ. Let M˜ be the moment matrix defined by
M˜jk=1nim˜ijm˜ik.
(4.17)
Then, we have the following theorem (the proof is given in appendix B).
Theorem 4.
When λ is large enough,
limλDλ[p:q]=12(q-p)TM˜(q-p),
(4.18)
which is a squared energy distance defined by the moment matrix M˜.

5  Right Barycenter of Patterns

We consider the barycenter of image patterns represented as probability measures p=p(ξ) on the plane ξ=(x,y)R2 using divergence Dλ. The plane is discretized into a grid of n×m pixels, and therefore p is a probability vector of size nm.

Let us consider a family S of patterns, S=(pi)i=1,,N. A right D-barycenter qD*(S) of these patterns is the minimizer of
FDr(S,q)=iDλ[pi:q].
(5.1)
Cuturi and Doucet (2014) used Cλ(p,q) to define the following barycenter, as a minimizer of
FC(S,q)=Cλ(pi,q).
(5.2)
We call such a minimizer the C-barycenter qC*(S). Cuturi and Doucet (2014) showed that the C-barycenter can extract a common shape for some pattern families S. In particular, they used a family S of deformed double rings of various sizes and positions, whose barycenter qC*(S) turns out to be a standard double-ring pattern. Other information-theoretic divergences such as the KL divergence or Heillinger divergence are not able to recover such a common shape. We will show that the right D-barycenter qD*(S) exhibits the same property, but with a sharper solution.
The right D-barycenter minimizes
11+λDλ(pi:q)=Cλ(pi,K˜λq)-Cλ(pi,K˜λpi).
(5.3)
The second term of the right-hand side of equation 5.3 does not depend on q, so it may be deleted for minimization. Let us put
q˜=K˜λq.
(5.4)
Then the D-barycenter is derived from the C-barycenter by
qC*=K˜λqD*,
(5.5)
provided equation 5.5 is solvable. In this case,
qD*=K˜λ-1qC*
(5.6)
is a sharpened version of qC*. However, equation 5.5 might not always be solvable.

The image K˜λSn-1 is a simplex sitting inside Sn-1. Since the C-barycenter qC* is not necessarily inside K˜λSn-1, we need to solve the D-barycenter problem, equation 5.3, under the constraint that q˜ is constrained inside K˜λSn-1. When the C-barycenter qC* is inside K˜λSn-1, the D-barycenter qD* is simply given by equation 5.5, which is more localized or sharper than qC*. When qC* is not inside K˜λSn-1, the solution of equation 5.3 is close to the boundary of the simplex K˜λSn-1, giving a sharper version.

Theorem 5.

The right D-barycenter qD*(S) of S is a sharper (more localized) version of the C-barycenter.

Agueh and Carlier (2011) showed that when the ground metric is the quadratic Euclidean distance, the W-barycenter has the property that its shape is determined from the shapes of each element pi in S, but does not depend on their location; namely, it is translation invariant. The C-barycenter also inherits this property, and we show that so does the right D-barycenter.

Let us consider a pattern p(ξ)=p(x,y) on the (x,y)-plane, where ξ=(x,y). The center ξp of p(x,y) is defined by
ξp=ξp(ξ)dξ.
(5.7)
Given a ξ¯, we define a shift operator Tξ¯, which shifts p(x,y) by ξ¯=(x¯,y¯):
Tξ¯p(ξ)=p(ξ-ξ¯).
(5.8)
Let P(ξ,ξ') be a transportation plan from p(ξ) to q(ξ). When q(ξ) is shifted as Tξ¯q(ξ), we naturally define the transportation plan Tξ¯P(ξ,ξ'):
Tξ¯P(ξ,ξ')=P¯(ξ,ξ')=P(ξ,ξ'-ξ¯),
(5.9)
which transports p(ξ) to Tξ¯q(ξ).
We study how the transportation cost changes by a shift. As recalled from above, we use the squared Euclidean distance as the ground cost:
m(ξ,ξ')=ξ-ξ'2.
(5.10)
Then the direct cost of transportation is
M,P=m(ξ,ξ')P(ξ,ξ')dξdξ'.
(5.11)
The cost of the shifted plan is
M,Tξ¯P=m(ξ,ξ')P(ξ,ξ'-ξ¯)dξdξ'
(5.12)
=m(ξ,ξ''+ξ¯)P(ξ,ξ'')dξdξ''
(5.13)
={ξ-ξ''2+ξ¯2-2ξ¯·(ξ-ξ'')}P(ξ,ξ'')dξdξ''
(5.14)
=M,P+ξ¯2-2ξ¯·(ξp-ξq).
(5.15)
Note that a shift does not change the entropy,
H{P(ξ,ξ')}=H{Tξ¯P(ξ,ξ')}.
(5.16)
When ξp=ξq, two patterns p and q are said to be co-centered. For two co-centered p,q, we have
L(Pp,q)L(Pp,K˜λp),
(5.17)
where Pp,q is a transportation plan sending p to q. Hence, q*=Kλp minimizes the transportation cost among all co-centered q.
We fix ξq and search for the optimal shape q*(ξ) located at ξq that minimizes the transportation cost Cλ(Pp,q) from p to q. In order to derive the optimal shape, we shift q by Tξ¯,
ξ¯=ξq-ξp,
(5.18)
such that p and q¯=Tξ¯q are co-centered. Then, for a plan Pp,q¯, we have Pp,q, which is the shifted plan of Pp,q¯ by -ξ¯:
Pp,q=T-ξ¯Pp,q¯,
(5.19)
q=T-ξ¯q¯.
(5.20)
We have
Cλ(TξP)=Cλ(P)+ξ¯2-2ξ¯·(ξp-ξq).
(5.21)
We have an important relation that Cλ(Pp,q) is decomposed into a sum of the shape deformation cost among co-centered patterns and the positional transportation cost as
Cλ(Pp,q)=Cλ(Pp,q¯)+ξp-ξq2
(5.22)
because of
ξp=ξq¯.
(5.23)
Lemma 1.
Given p, the Cλ-optimal pattern q(ξ) transporting p to q is a shifted and blurred version of p(ξ),
q*(ξ)=Tξp-ξqK˜λp(ξ),
(5.24)
not depending on the locations ξp and ξq.
Let p1(ξ),,pn(ξ) be n local patterns. We search for their C- and right D-barycenters q(ξ) that minimize
FC(q)=i=1nCλ(Ppi,q),FD(q)=i=1nDλ(Ppi,q).
(5.25)
The center of pi(ξ) is denoted by ξi=ξpi. Before solving the problem, let us respectively shift pi(ξ) from ξi to ξ0, a fixed common location, such that all shifted p¯i(ξ)'s are co-centered. Let qC*(ξ) and qD*(ξ) be the barycenters of all co-centered p¯i(ξ), which do not depend on the locations of pi(ξ) but their shapes.
Theorem 6

(shape-location separation theorem). The barycenters qC*(ξ) and qD*(ξ) of p1(ξ),,pn(ξ) are located at the barycenter of ξ1,,ξn, and their shapes are given by the respective barycenters of p¯1(ξ),,p¯n(ξ).

Proof.
From equation 5.22, we have
FC(q)=Cλ(Ppi,q¯i)+ξpi-ξq2,
(5.26)
FD(q)=Cλ(Ppi,K˜λq¯)+ξpi-ξq2,
(5.27)
where q¯i is the shifted version of q to the center of pi. Here, the shape and location of q are separated. Minimizing the first term, we have q*, which is the respective barycenters of the shapes of co-centered p1,,pn. The second term gives the barycenter of locations ξ1,,ξn.

Corollary 1.
When pi are shifts of an identical p, their right D-barycenter has the same shape as the original p, whereas the C-barycenter is a blurred version of p:
qC*=Kλp.
(5.28)

We show a simple example where pi are shifted p, a cat shape (see Figure 3a). Its C-barycenter has a blurred shape Kλp (see Figure 3b) tending to the uniform distribution as λ. However, the shape of the right D-barycenter is exactly the same as p (see Figure 3c).

Figure 3:

(a) Cat images. (b) The C-barycenter of panel a. The C-barycenter has a blurred shape Kλp, more blurred as λ. (c) The D-barycenter of panel a. The shape of the right D-barycenter is exactly the same as the original shape p.

Figure 3:

(a) Cat images. (b) The C-barycenter of panel a. The C-barycenter has a blurred shape Kλp, more blurred as λ. (c) The D-barycenter of panel a. The shape of the right D-barycenter is exactly the same as the original shape p.

6  Left Barycenter of Patterns

Since we use the assymetric divergence Dλ to define a barycenter, we may consider another barycenter by putting the unknown barycenter in the left argument of Dλ.

We consider again a family S of patterns, S=(qi)i=1,,N. A barycenter p of these patterns based on divergence Dλ is defined by the minimizer of
FDl(S,p)=iDλ[p:qi]
(6.1)
=i{Cλ(p,K˜λqi)-Cλ(p,K˜λp)}
(6.2)
and is called the left D-barycenter. We propose to solve that problem using the accelerated gradient descent approach outlined in Cuturi and Doucet (2014), with two differences: all examples qi must be smoothed beforehand following an application of K˜λ, and the gradient now incorporates not only terms resulting from Cλ(p,K˜λqi) but also from -Cλ(p,K˜λp), which, as explained in equation A.3, is simply minus the KL divergence between p and the vector K˜λ1, the entropy of p plus the dot product between p and log(K˜λ1). As a result the gradient of -Cλ(p,K˜λp) is equal to, up to a constant term, -log(p)+log(K˜λ1), which tends to sharpen further the Cλ barycenter. Note that this approach, namely, adding the entropy to the regularized Wasserstein barycenter problem, was used in a heuristic way by Solomon et al. (2015), who called it entropic sharpening, without proving that the resulting problem was convex. Our work shows that up to a given strength, the entropic sharpening of regularized Wasserstein barycenters remains a convex problem.

The shape-location separation theorem, theorem 6, also holds for the left barycenter. We note that qi and K˜λqi have the same center. So we shift all q˜i=K˜λqi to a common location ξ. Since the additional term Cλ(p,Kλp˜) is shift invariant, the terms concerning the shapes and locations are separated, as shown in equation 6.1. An example of the left D-barycenter of four patterns is shown in Figures 4 and 5.

Figure 4:

Four shapes considered in our experiments to compute barycenters according to Cλ or Dλ

Figure 4:

Four shapes considered in our experiments to compute barycenters according to Cλ or Dλ

Figure 5:

We consider in this example the squared-Euclidean distance on grids as the ground metric cost. We study the iso-barycenter of the four shapes presented in Figure 4 using two different discrepancy functions, Cλ and Dλ. (Left) Barycenter obtained using algorithm 2 of Solomon et al. (2015), itself a specialized version of Benamou, Carlier, Cuturi, Nenna, and Peyré's (2014) algorithm. We used a 10-10 tolerance for the l1 norm between two successive iterates as a stopping criterion). (Right) Dλ left-barycenter obtained with our accelerated gradient approach. We use a jet color map to highlight differences in the support of the barycenters. As expected, the entropy of the Cλ barycenter is higher than that of the Dλ left barycenter, since the latter optimization incorporates a penalization term, -Cλ(p,K˜λp), which is equivalent to penalizing the entropy of the solution p. This difference in entropy results in subtle yet visible differences between the two solutions, with sharper edges for the Dλ left barycenter.

Figure 5:

We consider in this example the squared-Euclidean distance on grids as the ground metric cost. We study the iso-barycenter of the four shapes presented in Figure 4 using two different discrepancy functions, Cλ and Dλ. (Left) Barycenter obtained using algorithm 2 of Solomon et al. (2015), itself a specialized version of Benamou, Carlier, Cuturi, Nenna, and Peyré's (2014) algorithm. We used a 10-10 tolerance for the l1 norm between two successive iterates as a stopping criterion). (Right) Dλ left-barycenter obtained with our accelerated gradient approach. We use a jet color map to highlight differences in the support of the barycenters. As expected, the entropy of the Cλ barycenter is higher than that of the Dλ left barycenter, since the latter optimization incorporates a penalization term, -Cλ(p,K˜λp), which is equivalent to penalizing the entropy of the solution p. This difference in entropy results in subtle yet visible differences between the two solutions, with sharper edges for the Dλ left barycenter.

7  Preliminary Simulations

It is important to study properties of the two D-barycenters, theoretically as well as by simulations. However, we need enough computational facilities for large-scale simulations. Here, we use the simple exponentiated gradient descent algorithm to obtain the D-barycenters as a preliminary study. We will explore more detailed study in forthcoming papers.

The exponentiated gradient updates the current q(p) to the next q as follows:
Update:qi'qiexp{-ηD},
(7.1)
Normalize:qi*=qi'qj',
(7.2)
giving the next candidate q* in the case of the right D-barycenter, where is the gradient operator /qi. Here, η is a learning constant. (The same holds by using p instead of q for the left D-barycenter.) As initial q(p), we used the C-barycenter, which is easy to compute using Benamou et al. (2014).

We study a toy model searching for the barycenters of the two patterns: a large square and a small square, shown in Figure 6a. We compare the C-, right D-, and left D-barycenters shown in Figure 6b. Both D-barycenters give a sharper result compared to the C-barycenter for various λ. However, the right and left D-barycenters are very similar. Further studies are necessary to compare the two.

Figure 6:

(a) Input images of two squares. A large square and a small square are located in a 50×50 pixel space. (b) The barycenters of the squares. The numerical value on each figure is the entropy of the barycenter.

Figure 6:

(a) Input images of two squares. A large square and a small square are located in a 50×50 pixel space. (b) The barycenters of the squares. The numerical value on each figure is the entropy of the barycenter.

8  Conclusion

We defined new divergences between two probability distributions based on the C-function, which is the entropy-regularized cost function (Cuturi, 2013). Although it is useful in many applications, it does not satisfy the criterion of a distance or divergence. We defined a new divergence function Dλ derived from Cλ, which works better than the original Cλ for some problems—in particular, the barycenter problem. We proved that the minimizer of Cλ(p,q) is given by K˜λp, where K˜λ is a diffusion operator depending on the base metric M. We studied properties of K˜λ showing how it changes as λ increases, elucidating properties of Dλ.

We applied Dλ to obtain the barycenter of a cluster of image patterns. It is proved that the right and left D-barycenters keep a good property that the shape and locations of patterns are separated, a merit of the C-function based barycenter. Moreover, the D-barycenters give even a sharper shape than the C-barycenter.

Computing D-barycenters provides a few novel numerical challenges. The algorithm we have proposed, essentially a simple multiplicative gradient descent, is not as efficient as that proposed by Benamou et al. (2014) for the C-barycenter. We believe there is room to improve our approach and leave this for future work.

Appendix A:  Proof of Convexity of Dλ

Let us put the Hessian of the cost function as follows:
Gλ=2Cλ(p,K˜λq)ηηT=XYYTZ,
(A.1)
expressed by block matrices X,Y,Z where η=(p,q)T. Because Cλ is strictly convex (Amari et al., 2018), Gλ is positive definite and the block component Z is also regular and positive definite.
By using
Pij*=piKijKi·,
(A.2)
we get
Cλ(p,K˜λp)=λi=1npilnpiKi·.
(A.3)
Therefore, its Hessian becomes
Gλ'=2Cλ(p,K˜λp)ηηT=λdiag1pi+1pn11TOOO,
(A.4)
where O is the zero matrix and diag(pi) represents a diagonal matrix whose ij component is given by piδij. Let us put R=Gλ-Gλ'. The determinant of R is given by
det(R)=det(Z)det(R'),
(A.5)
where we put
R'=X-YZ-1YT-λdiag1pi+1pn11T.
(A.6)
Because Z is positive definite, the positive definiteness of R is equivalent to that of R'. As derived in Amari et al. (2018),
Gλ-1=11+λpiδij-pipjPij-piqjPji-qipjqiδij-qiqj,
(A.7)
and we can represent the (p,p)-block part of Gλ-1 by using the block components of Gλ as follows:
(X-YZ-1YT)-1=11+λdiag(pi)-ppT.
(A.8)
By using the Sherman-Morrison formula, we get
X-YZ-1YT=(1+λ)diag1pi+1pn11T.
(A.9)
Finally, we obtain
R'=diag1pi+1pn11T.
(A.10)
Because this R' is positive definite, R is also positive definite and Dλ is strictly convex.

By using equation 3.5 and C0(p,p)=0, we can confirm that Dλ converges to the Wasserstein distance as λ0.

Appendix B:  Proof of Dλ Approaching an Energy Function for Large λ

We expand Dλ[p,q] in terms of K˜λ(q-p) as
1(1+λ)Dλ[p,q]=Cλp,K˜λp+K˜(q-p)-Cλp,K˜λp
(B.1)
=qCλp,K˜λp·K˜λ(q-p)+12qqCp,K˜λp·K˜λ(q-p)K˜λ(q-p)
(B.2)
=12qqCp,K˜λp·K˜λ(q-p)K˜λ(q-p),
(B.3)
because of
qCλp,K˜λp=0,
(B.4)
where is the tensor product. Higher-order terms are neglected.
When λ is large, we expand Kλ as
Kij,λ=exp-Mijλ=1-Mijλ,
(B.5)
Kj·,λ=n1-m¯·jλ,m¯j·=1niMji,
(B.6)
obtaining
K˜i|j,λ=KjiKj·=1n1-m˜ijλ,m˜ij=Mji-m¯j·
(B.7)
Hence,
K˜λ(q-p)i=1nλjm˜ijqj-pj.
(B.8)
We have
qqCp,K˜λp=qqCλp,1+O1λ.
(B.9)
So we calculate qqCλ(p,q˜) when
q˜i=K˜λpi=1n+ɛi
(B.10)
and expand it up to Oɛ2.
The optimal transportation plan for pq is
pij*=caibj
(B.11)
when λ, because Kij,λ=1. Hence,
Pij*=piqj.
(B.12)
We have already obtained the inverse of ηηCλ(p,q) in equation 45 of the previous paper:
Gλ-1=11+λpiδij-pipjPij*-piqjPji*-qipjqiδij-qiqj.
(B.13)
Hence, it is block-diagonal (Pij*-piqj=0), and the (q,q)-part of Gλ(λ) is
Gλ,qq=(1+λ)qiδij-qiqj-1
(B.14)
=(1+λ)diag1qi+1qn11T.
(B.15)
In our case of q=1/n,
G=qqC=(1+λ)(nδij+n).
(B.16)
We finally calculate
G·K˜λ(q-p)K˜λ(q-p)=(1+λ)nn2λ2m˜ijm˜ikqj-pjqk-pk+(1+λ)nn2λ2ijm˜ijqj-pj2=1+λλ2(q-p)TM˜(q-p).
Note that im˜ij=0.

References

References
Agueh
,
Ma.
, &
Carlier
,
G.
(
2011
).
Barycenters in the Wasserstein space
.
SIAM Journal on Mathematical Analysis
,
43
(
2
),
904
924
.
Amari
,
S.
(
2016
).
Information geometry and its applications
.
Berlin
:
Springer
.
Amari
,
S.
,
Karakida
,
R.
, &
Oizumi
,
M.
(
2018
).
Information geometry connecting Wasserstein distance and Küllback–Leibler divergence via the entropy-relaxed transportation problem
.
Information Geometry
,
1
(
1
),
13
57
. doi:10.1007/s41884-018-0002-8
Antoine
,
R.
,
Marco,
C.
, &
Gabriel,
P.
(
2016
).
Fast dictionary learning with a smoothed Wasserstein loss
. In
A.
Gretton
&
C. C.
Robert
(Eds.),
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR
,
51
,
630
638
.
Brookline, MA
:
Microtome
.
Arjovsky
,
M.
,
Chintala
,
S.
, &
Bottou
,
L.
(
2017
).
Wasserstein GAN
.
arXiv:1701.07875v3
.
Aude
,
G.
,
Gabriel
,
P.
, &
Marco
,
C.
(
2018
).
Learning generative models with sinkhorn divergences
. In
A.
Storkey
&
F.
Perez-Cruz
,
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR
,
84
,
1608
1617
.
Brookline, MA
:
Microtome
.
Benamou
,
J.-D.
,
Carlier
,
G.
,
Cuturi
,
M.
,
Nenna
,
L.
, &
Peyré
,
G.
(
2014
).
Iterative Bregman projections for regularized transportation problems
.
arXiv:1412.5154
.
Courty
,
N.
,
Flamary
,
R.
,
Tuia
,
D.
, &
Rakotomamonjy
,
A.
(
2017
).
Optimal transport for domain adaptation
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
39
(
9
),
1853
1865
. doi:10.1109/TPAMI.2016.2615921
Cuturi
,
M.
(
2013
).
Sinkhorn distances: Lightspeed computation of optimal transport
. In
C. J. C.
Burgess
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
2292
2300
).
Red Hook, NY
:
Curran
.
Cuturi
,
M.
, &
Doucet
,
A.
(
2014
).
Fast computation of Wasserstein barycenters
. In
Proceedings of the 31st International Conference on Machine Learning
(pp.
685
693
).
Cuturi
,
M.
, &
Peyré
,
G.
(
2016
).
A smoothed dual approach for variational Wasserstein problems
.
SIAM Journal on Imaging Sciences
,
9
(
1
),
320
343
.
Frogner
,
C.
,
Zhang
,
C.
,
Mobahi
,
H.
,
Araya
,
M.
, &
Poggio
,
T.
(
2015
).
Learning with a Wasserstein loss
. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
2053
2061
).
Red Hook, NY
:
Curran
.
Muzellec
,
B.
,
Nock
,
R.
,
Patrini
,
G.
, &
Nielsen
,
F.
(
2016
).
Tsallis regularized optimal transport and ecological inference
.
arXiv:1609.04495v1
.
Salimans
,
T.
,
Zhang
,
H.
,
Radford
,
A.
, &
Metaxas
,
D.
(
2018
).
Improving GANs using optimal transport
.
arXiv:1803.05573
.
Sinkhorn
,
R.
(
1964
).
A relationship between arbitrary positive matrices and doubly stochastic matrices
.
Annals of Mathematical Statistics
,
35
(
2
),
876
879
.
Solomon
,
J.
,
de Goes
,
F.
,
Peyré
,
G.
,
Cuturi
,
M.
,
Butscher
,
A.
,
Nguyen
,
A.
, Du, T., &
Guibas
,
L.
(
2015
).
Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains
.
ACM Transactions on Graphics (Proc. SIGGRAPH 2015)
.
New York
:
ACM
.
Villani
,
C.
(
2003
).
Topics in optimal transportation
.
Providence, RI
:
American Mathematical Society
.