## Abstract

In many areas of neural computation, like learning, optimization, estimation, and inference, suitable divergences play a key role. In this note, we study the conjecture presented by Amari (2009) and find a counterexample to show that the conjecture does not hold generally. Moreover, we investigate two classes of -divergence (Zhang, 2004), weighted f-divergence and weighted -divergence, and prove that if a divergence is a weighted f-divergence, as well as a Bregman divergence, then it is a weighted -divergence. This result reduces in form to the main theorem established by Amari (2009) when .

## 1  Introduction

In many areas of neural computation like learning, optimization, estimation, and inference, various divergences, which measure discrepancy between two points or between two probability distributions or positive measures, play a key role. Therefore, divergences are fundamental objects in information theory, statistics, mathematical programming, computational vision, and neural networks. The well-known important divergences are Kullback-Leibler divergence, f-divergence (Csiszár, 1963), Bregman divergence (Bregman, 1967), -divergence (Amari, 1985), Jenson difference (Rao, 1987), -divergence (Zhang, 2004), and U-divergence (Eguchi, 2008). So far, the theory of divergences and its applications have been well developed (see the references just mentioned, and Ackley, Hinton, & Sejnowski, 1985; Amari, 2007, 2009, 2016; Amari & Nagaoka, 2000; Cichocki, Adunek, Phan, & Amari, 2009; Csiszár, 1974; Eguchi & Copas, 2002; Jiao, Courtade, No, Venkat, & Weissman, 2015; Murata, Takenouchi, Kanamori, & Eguchi, 2004; Nielsen, 2009; Nielsen & Noch, 2009; Taneja & Kumar, 2004).

Inspired by these significant works, especially the work of Amari (2009) and Zhang (2004), in this note, we study the relationships between some divergences in the space of positive measures. As shown in previous literature (Cichocki et al., 2009; Murata et al., 2004; Nielsen, 2009), in order to deal with more complex problems from the real world, the ordinary constraint of a probability distribution that the total mass is 1 needs to be relaxed in many cases. A typical example is the visual signal, which is normally a two-dimensional array with nonnegative elements. Therefore, we still choose the space of positive measures as our basic setting of the space in the work. We first study the following conjecture, put forward by Amari (2009):

Conjecture: When a divergence satisfies information monotonicity, it is a function of an f-divergence.

We find a counterexample to show that this conjecture does not hold generally.

Second, we investigate two classes of -divergence (Zhang, 2004): weighted f-divergence and weighted -divergence, which are generalizations of the well-known f-divergence and -divergence, respectively. We prove that if a divergence is a weighted f-divergence as well as a Bregman divergence, then it is a weighted -divergence. This result reduces in form to the main theorem established by Amari (2009) when .

## 2  Several Classical Divergences and Their Basic Properties

In this section, based on the work cited in section 1, we recall several classical divergences and present their basic properties and the corresponding proofs.

### 2.1  Divergence

Let be a set of n elements on which a positive measure is defined as
When
it is a probability measure.
We use and to stand for the set of all positive measures on X and the set of all probability measures on X, respectively. It is clear that

A function , , is called a divergence when it satisfies the following conditions:

1. For small ,

We refer readers to the important work of Zhang (2004) for another more general definition of divergences.

A basic example of divergence is the square of the Euclidean distance:
Another basic example of divergence is the (squared) Hellinger divergence,
for any .

### 2.2  f-Divergence

Let f be a convex and twice continuously differentiable function on satisfying and . For every , define
Then is called an f-divergence in .

The f-divergence was introduced by Csiszár (1963). (See also Taneja & Kumar, 2004, for its detailed properties.)

#### 2.2.1  Case: In

In this case, we see that

1. The convexity of f implies that
2. If f is strictly convex, then
3. If we take , we get

Properties 3 and 4 show that without loss of generality, we can require that f satisfies
As in Amari (2009), we call f a standard convex function on if f is a strictly convex and twice continuously differentiable function on satisfying

#### 2.2.2  Case: In the general

Let f be a standard convex function on . Then we have
Hence,
• For every , we obtain
• The strict convexity of f implies that

### 2.3  -Divergence in

The Amari -divergence (Amari, 1985) is another important parametric family of divergence functionals.

The definition of -divergence is as follows. For any , set
Then the related -divergence is called the -divergence, and

### 2.4  Bregman Divergence

Let be a convex set, and let be a continuously differentiable real-valued and strictly convex function. Recall that the Bregman divergence associated with for points is defined by

### 2.5  Bregman Divergence in

Let be strictly monotone functions and
Let be a twice continuously differentiable strictly convex function, that is, , and is strictly convex.
For every , define
where
Then is called a Bregman divergence in , which is associated with .

## 3  On the Conjecture Presented by Amari (2009)

In this section, we give a counterexample to show that the conjecture presented by Amari (2009) is not generally true.

First, we recall the concept of information monotonicity.

Let Gm () be subsets of X such that
Then is called a partition of X, which is a coarsely grained version of X.
The partition naturally induces some kind of distribution over :
3.1
As a coarsely grained version of X, loses information by summarizing elements within each subset Gi. So it is natural to stipulate a monotonic relation,
where , , are related distributions of , defined as in equation 3.1.
Consider the case
3.2
Here pk and qk are proportional inside each class Gi, that is, the conditional distributions of and are equal, conditioned on . Then it is natural to assume that
because details of Gi do not give any information distinguishing from . The equality holds only in this case.

The above properties are called information monotonicity.

As Csiszár (1974) found, every f-divergence has information monotonicity. Actually, since f is convex function, we have
and for the case of equation 3.2, we get
Therefore, f-divergence has information monotonicity. Moreover, from Csiszár (1974) and Amari (2009), we know that the f-divergence is the only class of decomposable information monotonic divergences.

We are now in a position to give a counterexample to show that the conjecture above is not generally true.

Set
3.3
3.4
Then, f1 and f2 are two standard convex functions on . Let D1 and D2 be f-divergences derived from f1 and f2, respectively, and
3.5
where . Then it is clear that
and
Moreover, for small ,
gives a positive-definite quadratic form. Therefore, is a divergence.

Since D1 and D2 are f-divergences, they have an information monotonicity property. Thus, it is easy to see that in equation 3.5 has an information monotonicity property.

Next, we show by contradiction (i.e., the reductio ad absurdum argument) that is not a function of an f-divergence. Suppose this is not true. Then there exists an f-divergence Df and a function such that
3.6
that is, for all with and , we have
3.7
Clearly, equation 3.7 holds for the special case
that is,
Hence,
3.8
Next, we prove that equation 3.8 leads to a contradiction, which means that equation 3.6 leads to a contradiction. This implies that equation 3.6 (and even equation 3.8) does not hold; that is, is not a function of an f-divergence.
Set
Then by equations 3.8, 3.3, and 3.4, we have
3.9

Clearly, there exists a such that . Otherwise, for all , which means that the left side of the equality 3.9 is a constant . Taking on the right side of equation 3.9, we get Then, taking on the right side of equation 3.9, we see that but this is false.

Moreover, we know that
by noting that . For this fixed y0, we set
3.10
Then we obtain
In view of equation 3.9, we get, for ,
3.11
Since f is a standard convex function, that is, f is a twice continuously differentiable convex function on such that
it follows from L’Hospital’s rule that
Therefore,
Thus, by equation 3.11, we have
Hence,
3.12
Moreover, differentiating equation 3.11 twice with respect to x (or comparing the coefficients before x2 in equation 3.11) gives that
3.13
Clearly,
So by equation 3.13, we have
3.14
By equations 3.12 and 3.14, we get
This, together with equation 3.10, implies that
So,
that is,

Consequently, is not a function of an f-divergence, that is, the conjecture above is not true for .

Remark 1.

For the special case of n = 2, Amari’s conjecture has already been disproved by Jiao et al. (2015). (See also Amari, 2016.) Jiao et al. (2015) proved that the n = 2 case (binary X) is different from the general case and that an information monotone divergence of the Bregman type is not necessarily a KL-divergence, when n = 2, giving a general form of information-monotone divergence. It includes divergence that cannot be written as a function of an f-divergence.

We just disproved Amari’s conjecture in the general case .

## 4  Weighted f-Divergence and Weighted -Divergence

In this section, we are concerned with two classes of -divergences (Zhang, 2004), weighted f-divergence and weighted -divergence, which are generalizations of the well-known f-divergence and -divergence, respectively.

Definition 1.
Let f be a standard convex function on , and . For every , define
Then, is called a weighted f-divergence in .

The related weighted -divergence is thus called the weighted -divergence.

Obviously the f-divergence (resp. -divergence) is a special case of weighted f-divergence (resp. weighted -divergence) when
Moreover, since f is a standard convex function on , we have
Hence,
• For every ,
• The strict convexity of f implies that
Lemma 1.
Let be a Bregman divergence in . If is a decomposable divergence, then there exist such that are strictly convex functions, and for every
4.1
Moreover, in this case, we have, for every ,
4.2
Proof.
Since the divergence is a decomposable divergence, we know that there are divergences Di on such that for any
4.3
By the definition of a Bregman divergence in , we have
Hence,
So,
4.4
Therefore,
Thus,
4.5
where , and
For each and , we set
Then by equation 4.5, we see that for every
4.6
and is a strictly convex function. The equality 4.6 implies that equation 4.1 is true.

Morover, combining equations 4.3, 4.4, and 4.6, we deduce that equation 4.2 holds.

Now, we are in a position to prove that if a divergence is a weighted f-divergence as well as a Bregman divergence, then it is a weighted -divergence. This means that the weighted -divergence is the unique class of divergences sitting at the intersection of the weighted f-divergence and Bregman divergence classes. This result generalizes the main theorem of Amari (2009). Moreover, the approach in our proof is somewhat different from that in Amari (2009).

Theorem 1.

If a divergence is a weighted f-divergence as well as a Bregman divergence, then it is a weighted -divergence.

Proof.
Let the divergence be both the f-divergence and Bregman divergence on . Then the definition of a weighted f-divergence in says that
for every . Clearly, f-divergence is decomposable. Therefore, by lemma 3, we know that there exist such that are strictly convex functions, and for every ,
Therefore,
On the other hand
So, we obtain
Hence,
Let
Then we have
Since f is a standard convex function, that is, f is a twice continuously differentiable convex function on such that
we see that for all ,
It follows from a basic theorem about the functional equations that
where is a constant.
Therefore, we obtain
Let . Then
This means that the related weighted f-divergence is a weighted -divergence.

## 5  Conclusion

In this note, we have studied the conjecture presented by Amari (2009) and found a counterexample to show that the conjecture does not hold generally. Moreover, we have investigated two classes of -divergence (Zhang, 2004), weighted f-divergence and weighted, -divergence and proved that if a divergence is a weighted f-divergence as well as a Bregman divergence, then it is a weighted -divergence. This result reduces in form to the main theorem established by Amari (2009) when .

## Acknowledgments

I thank the referees very much for helpful comments and suggestions, which led to my revision of section 3. Moreover, I am very grateful to the referees for bringing the references Amari (2016) and Jiao et al. (2015) to my attention.

## References

Ackley
,
D. H.
,
Hinton
,
G. E.
, &
Sejnowski
,
T. J.
(
1985
).
A learning algorithm for Boltzmann machines
.
Cognitive Science
,
9
,
147
169
.
Amari
,
S.
(
1985
).
Differential geometric methods in statistics.
New York
:
Springer
.
Amari
,
S.
(
2007
).
Integration of stochastic models by minimizing -divergence
.
Neural Computation
19
,
2780
2796
.
Amari
,
S.
(
2009
).
-divergence is unique, belonging to both f-divergence and Bregman divergence
.
IEEE Transaction on Information Theory
,
55
,
4925
4931
.
Amari
,
S.
(
2016
).
Information geometry and its applications.
Berlin
:
Springer
.
Amari
S.
, &
Nagaoka
,
H.
(
2000
).
Methods of information geometry
.
New York
:
Oxford University Press
.
Bregman
,
L. M.
(
1967
).
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
.
USSR Computational Mathematics and Physics
,
7
,
200
217
.
Cichocki
,
A.
,
,
R.
,
Phan
,
A. H.
, &
Amari
S.
(
2009
).
Non-negative matrix and tensor factorizations: Applications to explanatory multi-way data analysis and blind source separation
.
New York
:
Wiley
.
Csiszár
,
I.
(
1963
).
Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten
.
Magyar Tudomnyos Akadmia Matematikai Kutat Intzetnek Kzlemnyei
,
8
,
85
108
.
Csiszár
,
I.
(
1974
).
Information measures: A critical survey
. In
Transactions of the 7th Conference on Information Theory Statistical Decision Functions, Random Processes
(pp.
73
86
).
Prague
:
.
Eguchi
,
S.
(
2008
).
Information divergence geometry and the application to statistical machine learning
. In
F.
Emmert-Streib
&
M.
Dehmer
(Eds.),
Information theory and statistical learning
(pp.
309
332
).
Berlin
:
Springer
.
Eguchi
,
S.
, &
Copas
,
J.
(
2002
).
A class of logistic type discriminant function
.
Biometrika
,
89
,
1
22
.
Jiao
,
J. T.
,
,
T. A.
,
No
,
A.
,
Venkat
,
K.
&
Weissman
,
T.
(
2015
),
Information measure: The curious case of the binary alphabet
.
IEEE Transactions on Information Theory
,
60
,
7616
7626
.
Murata
,
N.
,
Takenouchi
,
T.
,
Kanamori
,
T.
, &
Eguchi
,
S.
(
2004
).
Information geometry of U-boost and Bregman divergence
.
Neural Computation
,
26
,
1651
1686
.
Nielsen
,
F.
(Ed.). (
2009
).
Emerging trends in visual computing
.
Berlin
:
Springer-Verlag
.
Nielsen
,
F.
, &
Noch
,
R.
(
2009
).
Sided and symmetrized Bregman divergence
.
IEEE Transactions on Information Theory
,
55
,
2882
2904
.
Rao
,
C. R.
(
1987
).
Differential metrics in probability spaces
. In
S.
Amari
,
O.
Barndorff-Nielsen
,
R.
Kass
,
S.
Lauritzen
, &
C.R.
Rao
(Eds.),
Differential geometry in statistical interference
(pp.
217
240
).
Hayward, CA
:
Institute of Mathematical Interference
.
Taneja
,
I.
, &
Kumar
,
P.
(
2004
).
Relative information of type s, Csiszár’s f-divergence, and information inequalities
.
Information Science
,
166
,
105
125
.
Zhang
,
J.
(
2004
).
Divergence function, duality, and convex analysis
.
Neural Computation
,
16
,
159
195
.