## Abstract

In many areas of neural computation, like learning, optimization, estimation, and inference, suitable divergences play a key role. In this note, we study the conjecture presented by Amari (2009) and find a counterexample to show that the conjecture does not hold generally. Moreover, we investigate two classes of -divergence (Zhang, 2004), weighted *f*-divergence and weighted -divergence, and prove that if a divergence is a weighted *f*-divergence, as well as a Bregman divergence, then it is a weighted -divergence. This result reduces in form to the main theorem established by Amari (2009) when .

## 1 Introduction

In many areas of neural computation like learning, optimization, estimation, and inference, various divergences, which measure discrepancy between two points or between two probability distributions or positive measures, play a key role. Therefore, divergences are fundamental objects in information theory, statistics, mathematical programming, computational vision, and neural networks. The well-known important divergences are Kullback-Leibler divergence, *f*-divergence (Csiszár, 1963), Bregman divergence (Bregman, 1967), -divergence (Amari, 1985), Jenson difference (Rao, 1987), -divergence (Zhang, 2004), and *U*-divergence (Eguchi, 2008). So far, the theory of divergences and its applications have been well developed (see the references just mentioned, and Ackley, Hinton, & Sejnowski, 1985; Amari, 2007, 2009, 2016; Amari & Nagaoka, 2000; Cichocki, Adunek, Phan, & Amari, 2009; Csiszár, 1974; Eguchi & Copas, 2002; Jiao, Courtade, No, Venkat, & Weissman, 2015; Murata, Takenouchi, Kanamori, & Eguchi, 2004; Nielsen, 2009; Nielsen & Noch, 2009; Taneja & Kumar, 2004).

Inspired by these significant works, especially the work of Amari (2009) and Zhang (2004), in this note, we study the relationships between some divergences in the space of positive measures. As shown in previous literature (Cichocki et al., 2009; Murata et al., 2004; Nielsen, 2009), in order to deal with more complex problems from the real world, the ordinary constraint of a probability distribution that the total mass is 1 needs to be relaxed in many cases. A typical example is the visual signal, which is normally a two-dimensional array with nonnegative elements. Therefore, we still choose the space of positive measures as our basic setting of the space in the work. We first study the following conjecture, put forward by Amari (2009):

**Conjecture**: When a divergence satisfies information monotonicity, it is a function of an *f*-divergence.

We find a counterexample to show that this conjecture does not hold generally.

Second, we investigate two classes of -divergence (Zhang, 2004): weighted *f*-divergence and weighted -divergence, which are generalizations of the well-known *f*-divergence and -divergence, respectively. We prove that if a divergence is a weighted *f*-divergence as well as a Bregman divergence, then it is a weighted -divergence. This result reduces in form to the main theorem established by Amari (2009) when .

## 2 Several Classical Divergences and Their Basic Properties

In this section, based on the work cited in section 1, we recall several classical divergences and present their basic properties and the corresponding proofs.

### 2.1 Divergence

gives a positive-definite quadratic form.

We refer readers to the important work of Zhang (2004) for another more general definition of divergences.

### 2.2 *f*-Divergence

#### 2.2.1 Case: In

In this case, we see that

*f*satisfies As in Amari (2009), we call

*f*a standard convex function on if

*f*is a strictly convex and twice continuously differentiable function on satisfying

#### 2.2.2 Case: In the general

### 2.3 -Divergence in

The Amari -divergence (Amari, 1985) is another important parametric family of divergence functionals.

### 2.4 Bregman Divergence

### 2.5 Bregman Divergence in

## 3 On the Conjecture Presented by Amari (2009)

In this section, we give a counterexample to show that the conjecture presented by Amari (2009) is not generally true.

First, we recall the concept of information monotonicity.

*X*, loses information by summarizing elements within each subset

*G*. So it is natural to stipulate a monotonic relation, where , , are related distributions of , defined as in equation 3.1.

_{i}*p*and

_{k}*q*are proportional inside each class

_{k}*G*, that is, the conditional distributions of and are equal, conditioned on . Then it is natural to assume that because details of

_{i}*G*do not give any information distinguishing from . The equality holds only in this case.

_{i}The above properties are called information monotonicity.

*f*-divergence has information monotonicity. Actually, since

*f*is convex function, we have and for the case of equation 3.2, we get Therefore,

*f*-divergence has information monotonicity. Moreover, from Csiszár (1974) and Amari (2009), we know that the

*f*-divergence is the only class of decomposable information monotonic divergences.

We are now in a position to give a counterexample to show that the conjecture above is not generally true.

Since *D*_{1} and *D*_{2} are *f*-divergences, they have an information monotonicity property. Thus, it is easy to see that in equation 3.5 has an information monotonicity property.

*f*-divergence. Suppose this is not true. Then there exists an

*f*-divergence

*D*and a function such that that is, for all with and , we have Clearly, equation 3.7 holds for the special case that is, Hence, Next, we prove that equation 3.8 leads to a contradiction, which means that equation 3.6 leads to a contradiction. This implies that equation 3.6 (and even equation 3.8) does not hold; that is, is not a function of an

_{f}*f*-divergence.

Clearly, there exists a such that . Otherwise, for all , which means that the left side of the equality 3.9 is a constant . Taking on the right side of equation 3.9, we get Then, taking on the right side of equation 3.9, we see that but this is false.

*y*

_{0}, we set Then we obtain In view of equation 3.9, we get, for , Since

*f*is a standard convex function, that is,

*f*is a twice continuously differentiable convex function on such that it follows from L’Hospital’s rule that Therefore, Thus, by equation 3.11, we have Hence, Moreover, differentiating equation 3.11 twice with respect to

*x*(or comparing the coefficients before

*x*

^{2}in equation 3.11) gives that Clearly, So by equation 3.13, we have By equations 3.12 and 3.14, we get This, together with equation 3.10, implies that So, that is, This contradicts .

Consequently, is not a function of an *f*-divergence, that is, the conjecture above is not true for .

For the special case of *n* = 2, Amari’s conjecture has already been disproved by Jiao et al. (2015). (See also Amari, 2016.) Jiao et al. (2015) proved that the *n* = 2 case (binary *X*) is different from the general case and that an information monotone divergence of the Bregman type is not necessarily a KL-divergence, when *n* = 2, giving a general form of information-monotone divergence. It includes divergence that cannot be written as a function of an *f*-divergence.

We just disproved Amari’s conjecture in the general case .

## 4 Weighted *f*-Divergence and Weighted -Divergence

In this section, we are concerned with two classes of -divergences (Zhang, 2004), weighted *f*-divergence and weighted -divergence, which are generalizations of the well-known *f*-divergence and -divergence, respectively.

*D*on such that for any By the definition of a Bregman divergence in , we have Hence, So, Therefore, Thus, where , and

_{i}Morover, combining equations 4.3, 4.4, and 4.6, we deduce that equation 4.2 holds.

Now, we are in a position to prove that if a divergence is a weighted *f*-divergence as well as a Bregman divergence, then it is a weighted -divergence. This means that the weighted -divergence is the unique class of divergences sitting at the intersection of the weighted *f*-divergence and Bregman divergence classes. This result generalizes the main theorem of Amari (2009). Moreover, the approach in our proof is somewhat different from that in Amari (2009).

If a divergence is a weighted *f*-divergence as well as a Bregman divergence, then it is a weighted -divergence.

*f*-divergence and Bregman divergence on . Then the definition of a weighted

*f*-divergence in says that for every . Clearly,

*f*-divergence is decomposable. Therefore, by lemma

^{3}, we know that there exist such that are strictly convex functions, and for every , Therefore, On the other hand So, we obtain Hence, Let Then we have Since

*f*is a standard convex function, that is,

*f*is a twice continuously differentiable convex function on such that we see that for all , It follows from a basic theorem about the functional equations that where is a constant.

## 5 Conclusion

In this note, we have studied the conjecture presented by Amari (2009) and found a counterexample to show that the conjecture does not hold generally. Moreover, we have investigated two classes of -divergence (Zhang, 2004), weighted *f*-divergence and weighted, -divergence and proved that if a divergence is a weighted *f*-divergence as well as a Bregman divergence, then it is a weighted -divergence. This result reduces in form to the main theorem established by Amari (2009) when .

## Acknowledgments

## References

*f*-divergence and Bregman divergence

*U*-boost and Bregman divergence

*s*, Csiszár’s

*f*-divergence, and information inequalities