Abstract

Supervised and unsupervised vector quantization methods for classification and clustering traditionally use dissimilarities, frequently taken as Euclidean distances. In this article, we investigate the applicability of divergences instead, focusing on online learning. We deduce the mathematical fundamentals for its utilization in gradient-based online vector quantization algorithms. It bears on the generalized derivatives of the divergences known as Fréchet derivatives in functional analysis, which reduces in finite-dimensional problems to partial derivatives in a natural way. We demonstrate the application of this methodology for widely applied supervised and unsupervised online vector quantization schemes, including self-organizing maps, neural gas, and learning vector quantization. Additionally, principles for hyperparameter optimization and relevance learning for parameterized divergences in the case of supervised vector quantization are given to achieve improved classification accuracy.

1.  Introduction

Supervised and unsupervised vector quantization for classification and clustering is strongly associated with the concept of dissimilarity, usually judged in terms of distances. The most common choice is the Euclidean metric. Recently, however, alternative dissimilarity measures have become attractive for advanced data processing. Examples are functional metrics like Sobolev distances or kernel-based dissimilarity measures (Villmann & Schleif, 2009; Lee & Verleysen, 2007). These metrics take the functional structure of the data into account (Lee & Verleysen, 2005; Ramsay & Silverman, 2006; Rossi, Delannay, Conan-Gueza, & Verleysen, 2005; Villmann, 2007).

Information theory–based vector quantization approaches are proposed considering divergences for clustering (Banerjee, Merugu, Dhillon, & Ghosh, 2005; Jang, Fyfe, & Ko, 2008; Lehn-Schiøler, Hegde, Erdogmus, & Principe, 2005; Hegde, Erdogmus, Lehn-Schiøler, Rao, & Principe, 2004). For other data processing methods like multidimensional scaling (MDS; Lai & Fyfe, 2009), stochastic neighbor embedding (Maaten & Hinton, 2008), blind source separation (Minami & Eguchi, 2002), or nonnegative matrix factorization (Cichocki, Lee, Kim, & Choi, 2008), divergence-based approaches are also introduced. In prototype-based classification, first approaches using information-theoretic approaches have been proposed (Erdogmus, 2002; Torkkola, 2003; Villmann, Hammer, Schleif, Hermann, & Cottrell, 2008).

Yet a systematic analysis of prototype-based clustering and classification relying on divergences has not yet been given. Further, the existing approaches usually are carried out in batch mode for optimization but are not available for online learning, which requires calculating the derivatives of the underlying metrics (i.e., divergences).

In this letter, we offer a systematic approach for divergence-based vector quantization using divergence derivatives. For this purpose, important but general classes of divergences are identified, widely following and extending the scheme introduced by Cichocki, Zdunek, Phan, and Amari (2009). The mathematical framework for functional derivatives of continuous divergences is given by the functional-analytic generalization of common derivatives—the concept of Fréchet derivatives (Frigyik, Srivastava, & Gupta, 2008b; Kantorowitsch & Akilow, 1978). This can be seen as a generalization of partial derivatives for discrete variants of the divergences. The functional approach is here preferred for clarity. Yet it also offers greater flexibility in specific variants of functional data processing (Villmann, Haase, Simmuteit, Haase, & Schleif, 2010).

After characterizing the different classes of divergences and introducing Fréchet derivatives, we apply this framework to several divergences and divergence classes to obtain generalized derivatives, which can be used for online learning in divergence-based methods for supervised and unsupervised vector quantization as well as other gradient-based approaches. We explicitly explore the derivatives to provide examples.

Then we consider some of the most prominent approaches for unsupervised as well as supervised prototype-based vector quantization in the light of divergence-based online learning using Fré chet derivatives, including self-organizing maps (SOM), neural gas (NG), and generalized learning vector quantization (GLVQ). For the GLVQ supervised approach, we also provide a gradient learning scheme, hyperparameter adaptation, for optimizing parameters that occur in the case of parameterized divergences.

The focus of the letter is mainly on giving a unified framework for the application of widely ranged divergences and classes thereof in gradient-based online vector quantization and their mathematical foundation. We formulate the problem in a functional manner following the approaches in Frigyik et al. (2008b), Csiszár (1967), and Liese and Vajda (2006). This allows a compact description of the mathematical theory based on the concept of Fréchet derivatives. We also state that the functional approach includes a larger class of divergence functionals than the discrete (pointwise) approach as Frigyik, Srivastava, and Gupta (2008a) point out. Beside these extensions, the functional approach using Fré chet derivatives obviously reduces to partial derivatives for the discrete case. We therefore prefer the functional approach in this letter.

However, as a proof of concept, we show for several classes of parameterized divergences their utilization in SOM learning for an artificial but illustrate example in comparison to Euclidean distance learning as standard.

2.  Characterization of Divergences

Generally, divergences are functionals designed for determining a similarity between nonnegative integrable measure functions p and ρ with a domain V and the constraints p(x) ⩽ 1 and ρ(x) ⩽ 1 for all xV. We denote such measure functions as positive measures. The weight of the functional p is defined as
2.1
Positive measures p with weight W(p) = 1 are denoted as (probability) density functions.1

Divergences D(p||ρ) are defined as functionals that have to be nonnegative and zero iff p ≡ ρ except on a zero-measure set. Further, D(p||ρ) is required to be convex with respect to the first argument. Yet divergences are neither necessarily symmetric nor have to fulfill the triangle inequality as it is supposed for metrics. According to the classification given in Cichocki et al. (2009), one can distinguish at least three main classes of divergences: Bregman divergences, Csiszár's f-divergences, and γ-divergences emphasizing different properties. We offer some basic properties about these but do not go into detail about them because this would be outside the scope of the letter. (For detailed property investigations, see Cichocki & Amari, 2010, and Cichocki et al., 2009.)

We generally assume that p and ρ are positive measure (densities) that are not necessarily normalized. In case of (normalized) densities, we explicitly refer to these as probability densities.

2.1.  Bregman Divergences.

Bregman divergences are defined by generating convex functions Φ in the following way using a functional interpretation (Bregman, 1967; Frigyik et al., 2008b).

Let Φ be a strictly convex real-valued function with the domain (the Lebesgue-integrable functions). Further, Φ is assumed to be twice continuously Fréchet differentiable (Kantorowitsch & Akilow, 1978). A Bregman divergence is defined as with
2.2
whereby is the Fréchet derivative of Φ with respect to ρ (see section 3.1).

The Bregman divergence DBΦ(p||ρ) can be interpreted as a measure of convexity of the generating function Φ. Taking p and ρ as points in a functional space, DBΦ(p||ρ) plays the role of vertical distance between p and the tangential hyperplane to the graph of Φ at point ρ, which is illustrated in Figure 1.

Figure 1:

Illustration of the Bregman divergence as a vertical distance between p and the tangential hyperplane to the graph of Φ at point , taking p and as points in a functional space.

Figure 1:

Illustration of the Bregman divergence as a vertical distance between p and the tangential hyperplane to the graph of Φ at point , taking p and as points in a functional space.

Bregman divergences are linear according to the generating function Φ:
Further, DBΦ(p||ρ) is invariant under affine transforms Γ(q) = Φ(q) + Ψg[q] + ξ for positive measures g and q with a requirement that Γ(q), Φ(q), and are not independent but have to be related according to
Further, Ψg is supposed to be a linear operator independent of q (Frigyik et al., 2008a) and ξ is a scalar. In that case,
is valid. Further, the generalized Pythagorean theorem holds for any triple p, ρ, τ of positive measures:
The sensitivity of a Bregman divergence at p is defined as
2.3
with and the restriction that ∫τ(x) dx = 0 (Santos-Rodríguez, Guerrero-Curieses, Alaiz-Rodríguez, & Cid-Sueiro, 2009). Note that is the Hessian of the generating function. The sensitivity s(p, τ) measures the velocity of change of the divergence at point p in the direction of τ.

A last property mentioned here is an optimality one (Banerjee et al., 2005). Given a set S of positive measures p with the (functional) mean μ = E[pS] and the additional restriction that μ is a relative interior of S,2 then for given pS, the unique minimizer of Ep[DBΦ(p||ρ)] is ρ = μ. The inverse direction of this statement is also true: if Ep[DBF(p||ρ)] is minimum for ρ = μ, then DBF(p||ρ) is a Bregman divergence. This property predestinates Bregman divergences for clustering problems.

Finally, we give some important examples:

• •
Generalized Kullback-Leibler divergence for non-normalized p and ρ (Cichocki et al., 2009):
2.4
with the generating function
If p and ρ are normalized densities (probability densities), DGKL(p||ρ) is reduced to the usual Kullback-Leibler divergence (Kullback & Leibler, 1951; Kapur, 1994),
2.5
which is related to the Shannon-entropy (Shannon, 1948),
2.6
via
where
is Shannon's cross-entropy.
• •
Itakura-Saito divergence (Itakura & Saito, 1973),
2.7
based on the Burg entropy,
which also serves as the generating function
The Itakura-Saito divergence is also known as negative cross-Burg entropy and fulfills the scale-invariance property, that is, DIS(c · p||c · ρ) = DIS(p||ρ). So the same relative weight is given to low- and high-energy components of p (Bertin, Fevotte, & Badeau, 2009). Due to this, the Itakura-Saito divergence is frequently applied in image processing and sound processing.
• •
The Euclidean distance in terms of a Bregman divergence is obtained by the generating function
We extend this definition and introduce the parameterized version,
defining the η-divergence, also known as norm-like divergence (Nielsen & Nock, 2009):
2.8
which converges to the Euclidean distance for η → 2. To ensure the convexity of Φ(f), the restriction to η>1 is required.

If we assume that p and ρ are positive measures, then an important subset of Bregman divergences belongs to the class of β-divergences (Eguchi & Kano, 2001), which are defined, following Cichocki et al. (2009), as
2.9
2.10
with β ≠ 1 and β ≠ 0 with the generating function
In the limit β → 1, the divergence Dβ(p, ρ) becomes the generalized Kullback-Leibler divergence (see equation 2.4).3 The limit β → 0 gives the Itakura-Saito divergence (see equation 2.7). Further, β-divergences are equivalent to the density power divergences introduced in Basu, Harris, Hjort, and Jones (1998) by
Obviously, the η divergence (see equation 2.8) is a rescaled version of the β-divergence:
Thus, we see that for β = 2, the β-divergence Dβ(p||ρ) becomes (half) the Euclidean distance.

2.2.  Csiszár's -Divergences.

Csiszár's f-divergences are defined for real-valued, convex, continuous functions with f(1) = 0 (without loss of generality) whereby
The f-divergences Df for positive measures p and ρ are given by
2.11
with the definitions , (Csiszár, 1967; Liese & Vajda, 2006; Taneja & Kumar, 2004). f is called the determining function for Df(p||ρ). It corresponds to a generalized f-entropy (Cichocki et al., 2009) of the form
2.12
via
2.13
with being the constant function of value 1 and c is a divergence-depending constant (Cichocki & Amari, 2010).
The f-divergence Df can be interpreted as an average (with respect to ρ) of the likelihood ratio describing the change rate of p with respect to ρ weighted by the determining function f. Df(p||ρ) is jointly convex in both p and ρ. Further, f defines an equivalence class in the sense that iff for , that is, Df(p||ρ) is invariant according to a linear shift regarding the determining function f. For f-divergences, a certain kind of symmetry can be stated. Let and f* is the conjugate function of f, that is, for x ∈ (0, ∞). Then the relation is valid iff the conjugate differs from the original by a linear shift as above: f(x) = f*(x) + c · (x − 1). A symmetric divergence can be obtained for an arbitrary convex function using its conjugate g* for the definition f = g + g* as a determining function. Further, the conjugate is important for an upper bound of the divergence. Let and p as well as ρ densities. Then the f-divergence is bounded by
2.14
if the limit exists, as it was shown in Liese and Vajda (1987). Yet this statement can extended to p and ρ being positive measures:

Lemma 1.

Let p and ρ be positive measures. Then the bounds given in equation 2.14 are still valid.

Proof.

The proof is given in appendix B.

An important and characterizing property is the monotonicity with respect to the coarse graining of the underlying domain of the positive measures p and ρ, which is similarly to the monotonicity of the Fisher metric (Amari & Nagaoka, 2000). Let , with being the range of y. κ describes a transition probability density, that is, ∫κ(y|x)dy = 1 holds ∀x. Denoting the positive measures of y derived from p(x) and ρ(x) by pκ(y) and ρκ(y), the monotonicity is expressed by Df(p||ρ) ⩾ Df(pκ||ρκ). 4 Further, an isomorphism can be stated for f-divergences in the following way. Let
2.15
be an invertible function transforming positive measures p1(x) and ρ1(x) to p2(y) and ρ2(y). Then Df(p1||ρ1) = Df(p2||ρ2) holds, and the pairs (p1, ρ1) and (p2, ρ2)are called isomorph (Liese & Vajda, 1987). Conversely, if a measure D(p||ρ) = ∫ρ(x) · G(p(x), ρ(x))dx for an integrable function G is invariant according to invertible transformations h, then D is an f-divergence (Qiao & Minematsu, 2008). This isomorphism, as well as the monotonicity, employ f-divergences for application in speech, signal, and pattern recognition (Basseville, 1988; Qiao & Minematsu, 2008). Finally, Cichocki et al. (2009), suggested a generalization of the f-divergences Df. In that divergence, f is no longer convex. It is proposed to be
2.16
with cf = f′(1) ≠ 0 and denoted as a generalized f-divergence. As a consequence of this relaxation of the convexity condition, in the case of p and ρ being probability densities, the first term vanishes, such that the usual form of f-divergences is obtained. Thus, as a famous example, the Hellinger divergence (Taneja & Kumar, 2004) is
2.17
with the generating function for . Acoording to Cichocki et al. (2009), DH(p||ρ) is a properly defined f-divergence only for probability densities p and ρ.
As the β divergences in the case of Bregman divergences, one can identify an important subset of the f-divergences—the so-called α- divergences according to the definition given in Cichocki et al. (2009):
2.18
2.19
with the generating f-function
and and α>0. In the limit α → 1 the generalized Kullback-Leibler divergence DGKL (see equation 2.4) is obtained. Further, Cichocki et al. (2009) state that β-divergences can be generated from α-divergences by applying the nonlinear transforms

In addition to the general properties of the f-divergences stated here, one can derive a characteristic behavior for the α-divergences directly from equation 2.18 depending on the choice of the parameter α (Minka, 2005). For α ≪ 0 the minimization of Dα(p||ρ) to estimate ρ(x) may exclude modes of the target p(x). Further, for α ⩽ 0, the α-divergence is zero-forcing (i.e., p(x) = 0 forces ρ(x) = 0), while for α ⩾ 1, it is zero-avoiding (i.e., ρ(x)>0 whenever p(x)>0). For α → ∞, ρ(x) covers p(x) completely, and the α-divergence is called inclusive in that case.

The Tsallis-divergence is a widely applied divergence related to α-divergence (see equation 2.18); however, it is defined only for probability densities. It is defined as
2.20
with the convention
2.21
such that
2.22
and α ≠ 1. Obviously this is a rescaled version of the α- divergence (see equation 2.18), which holds only for probability densities (Cichocki & Amari, 2010):
2.23
The Tsallis divergence is based on the Tsallis entropy,
2.24
2.25
with logα(p) as defined in (see equation 2.21). In the limit, α → 1 for HTα(p) becomes the Shannon entropy (see equation 2.6) and the divergence DTα(p||ρ) converges to the Kullback-Leibler divergence (see equation 2.5).
Further, the α-divergences are closely related to the generalized Rényi divergences defined as (Amari, 1985; Cichocki et al., 2009):
2.26
for positive measures ρ and p. Lemma 1 can be used to write the generalized Rényi divergence in terms of the α-divergence:5
2.27
For probability densities, DGRα(p||ρ) reduces to the usual Rényi divergence (Renyi, 1961, 1970):
2.28
The divergence DRα(p||ρ) is based on the Rényi entropy
2.29
via equation 2.13. The Rényi entropy fulfills the additivity property for independent probabilities p and q:
Further, the entropy HRα(p) is related to the Tsallis entropy (see equation 2.25) by
which, however, has in consequence a different subadditivity property,
for α ≠ 1.

2.3.  -Divergences.

A class of robust divergences with respect to outliers has been proposed by Fujisawa and Eguchi (2008).6 Called γ-divergences it is defined for positive measures ρ and p as
2.30
2.31
The divergence Dγ(p||ρ) is invariant under scalar multiplication with positive constants c1 and c2:
The equation Dγ(p||ρ) = 0 holds only if p = c · ρ (c>0) in the case of positive measures. Yet for probability densities, c = 1 is required. In contradiction to the f-divergences, an isomorphism here can be stated for h-transformations (see equation 2.15) which are more strictly assumed to be affine.
As for Bregman divergences, a modified Pythagorean relation between positive measures can be stated for special choices of positive measures p, ρ, τ. Let p be a distortion of ρ defined as a convex combination with a positive distortion measure δ:
Further, a positive measure g is denoted as δ-consistent if
is sufficiently small for large α>0. If two positive measures ρ and τ are δ-consistent according to a distortion measure δ, then the Pythagorean relation approximately holds for ρ, τ and the distortion pε of ρ,
with ν = max{νρ, ντ}. This property implies the robustness of γ-divergences with respect to distortions according to the resulting approximation,
and Dγ(pε||ρ) should be small because pε is assumed to be a distortion of ρ (Fujisawa & Eguchi, 2008).
In the limit γ → 0 Dγ(ρ||p) becomes the usual Kullback-Leibler divergence (see equation 2.5) with normalized densities
For γ = 1 the γ-divergence becomes the Cauchy-Schwarz divergence
2.32
with
2.33
being the cross-correlation potential. The Cauchy-Schwarz divergence DCS(p||ρ) was introduced by Principe, Xu, and Fisher (2000) considering the Cauchy-Schwarz inequality for norms. It is based on the quadratic Rényi-entropy HR2(p) from equation 2.29 (Jenssen, 2005). Obviously, DCS(p||ρ) is symmetric. It is frequently applied for Parzen window estimation and is particularly suitable for spectral clustering as well as for related graph cut problems (Jenssen, Principe, Erdogmus, & Eltoft, 2006).

3.  Derivatives of Divergences: A Functional Analytic Approach

In this section we provide the mathematical formalism of generalized derivatives for functionals p and ρ, known as Fré chet derivatives or functional derivatives. First, we briefly reconsider the theory of functional derivatives derivatives. Then we investigate the divergence classes within this framework. In particular, we explain their Fré chet derivatives.

3.1.  Functional (Fréchet) Derivatives.

Suppose X and Y are Banach spaces, UX is open, and F:XY. F is called Fréchet differentiable at xX if there exists a bounded linear operator , such that for hX, the limit is
This general definition can be focused for functional mapping. Let L be a functional mapping from a linear, functional Banach space B to . Further, let B be equipped with a norm ‖ · ‖, and f, hB are two functionals. The Fréchet derivative of L at point f is formally defined as
with linear in h. The existence and continuity of the limit are equivalent to the existence and continuity of the derivative. (For a detailed introduction, see Kantorowitsch & Akilow, 1978.)
If L is linear, then L[f + εh] − L[f] = εL[h] and, hence, . Further, an analogon of the chain rule known from differential calculus can be stated: let be a continuously differentiable mapping. We consider the functional
Then the Fréchet derivative is determined by the derivative F′ as can be seen from
and use of the linear property of the integral operator.

This property motivates an important remark about divergences, which can be seen as special integral operators:

Remark 1.

Let Lg be an integral operator Lg[f] = ∫Fg(f(x))dx depending on a fixed functional gB. Then the Fréchet derivative is determined by the integral kernel Fg(f(x)) = Q(g(x), f(x)) being a function in x. Therefore, frequently the Fréchet derivative is simply identified with Q(g(x), f(x)) and written as but keeping in mind its original interpretation as an integral kernel defining the integral operator. We will make use from this abbreviation in the following considering divergences as integral operators D(p||ρ) = Lp[ρ] and write , also denoted here as Fré chet derivative, for simplicity.

Finally, we remark that the Fréchet derivative in finite-dimensional spaces reduces to the usual partial derivative. In particular, it is represented in coordinates by the Jacobi matrix. Thus, the Fréchet derivative is a generalization of the directional derivatives.

3.2.  Fréchet Derivatives for the Different Divergence Classes.

We are now ready to investigate functional derivatives of divergences. In particular we focus on Fréchet derivatives.

3.2.1.  Bregman Divergences.

We investigate the Fréchet derivative for the Bregman divergences (see equation 2.2) and formally obtain
3.1
with
In the case of the generalized Kullback-Leibler-divergence (see equation 2.4) this reads as
3.2
whereas for the usual Kullback-Leibler divergence, equation 2.5,
3.3
is obtained.
For the Itakura-Saito divergence, equation 2.7, we get
3.4
The η-divergence, equation 2.8, leads to
3.5
which reduces in the case of η = 2 to the derivative of the Euclidean distance −2(p − ρ), commonly used in many vector quantization algorithms, including the online variant of k-means, SOMs, NG, and so on.
Further, for the subset of β-divergences, equation 2.9, we have
3.6
3.7

3.2.2.  f-Divergences.

For f-divergences, equation 2.11, the Fréchet derivative is
3.8
with . As a famous example, we get for the Hellinger divergence, equation 2.17,
3.9
The subset of α-divergences, equation 2.18, can be handled by
3.10
The related Tsallis divergence DTα, equation 2.22, leads to the derivative
3.11
depending on the parameter α. The generalized Rényi divergences, equation 2.26, are treated according to
3.12
which is reduced to
3.13
in the case of the usual Rényi divergences, equation 2.28.

3.2.3.  -Divergences.

For the γ-divergences, we rewrite equation 2.30 as
with and . Then we get
with
and
such that finally yields
3.14
3.15
Considering the important special case γ = 1, the Fréchet derivative of the Cauchy-Schwarz divergence, equation 2.32, is derived:
3.16

4.  Divergence-Based Online Vector Quantization Using Derivatives

Supervised and unsupervised vector quantization frequently are described in terms of dissimilarities or distances. Suppose data are given as data vectors .

Here we focus on prototype-based vector quantization: data processing (clustering or classification) is realized using prototypes as representatives, whereby the dissimilarity between data points, as well as between data and prototypes, is determined by dissimilarity measures ξ (not necessarily fulfilling triangle inequality or symmetry restrictions).

Frequently, such algorithms somewhat optimize a cost function E depending on the dissimilarity between the data points and the prototypes; usually one has E = E(ξ(vi, wk)) and i = 1, …, N the number of data and k = 1, …, C the number of prototypes. This cost function can be a variant of the usual classification error in supervised learning or modified mean squared error of the dissimilarities ξ(vi, wk).

If E = E(ξ(vi, wk)) is differentiable with respect to ξ, and ξ is differentiable with respect to the prototype w, then a stochastic gradient minimization is a widely used optimization scheme for E. This methodology implies the calculation of the dissimilarity derivatives , which now has to be considered in light of the above functional analytic investigations for divergence measures (i.e., we replace the dissimilarity measure ξ by divergences).

Therefore, we now assume that the data vectors are discrete representations of continuous positive measures p(x) with vi = p(xi), i = 1, …, N as required for divergences. Such data may be spectra or other frequency data occurring in many kinds of application like remote-sensing data analysis, mass spectrometry, or signal processing. Thereby, the restriction vi ∈ [0, 1] for positive measures can be fulfilled simply by dividing all data vectors by the maximum vector entry, taking over all vectors and vector components of the data set. In case of probability densities, a subsequent normalization to stress ‖v1 = 1 is required.

Further, we also identify the prototypes as discrete realizations of positive measures ρ(x). Then the derivative has to be replaced by the (abbreviated) Fréchet derivative in the continuous case (see remark 1), which reduces to usual partial derivatives in the discrete case. This is formally achieved by replacing p and ρ by their vectorial counterparts v and w in the formulas of the divergences provided in section 3.2 and further translating integrals into sums.

In the following, we give prominent examples of unsupervised and supervised vector quantization, which can be optimized by gradient methods using the framework already introduced.

4.1.  Unsupervised Vector Quantization.

4.1.1.  Basic Vector Quantization.

Unsupervised vector quantization is a class of algorithm for distributing prototypes W = {wk}Z, such that data points are faithfully represented in terms of a dissimilarity measure ξ. Thereby, C = card(Z) is the cardinality of the index set Z. More formally, the data point v is represented by this prototype ws(v) minimizing the dissimilarity ξ(v, wk):
4.1
The aim of the algorithm is to distribute the prototypes in such a way that the quantization error
4.2
is minimized. In its simplest form, basic vector quantization (VQ) leads to a (stochastic) gradient descent on EVQ with
4.3
for prototype update of the winning prototype ws(v) according to equation 4.1, also known as the online variant of LBG algorithm (C–means; Linde, Buzo, & Gray, 1980; Zador, 1982). Here, ε is a small, positive value called the learning rate. As we see, update 4.3 takes into account the derivative of the dissimilarity measure ξ with respect to the prototype. Beside the common choice of ξ being the squared Euclidean distance, the choice is given to the user with the restriction of differentiability. Hence, here we are allowed to apply divergences using derivatives in the sense of Fréchet derivatives.

4.1.2.  Self-Organizing Maps and Neural Gas.

There are several variants of the basic vector quantization scheme to avoid local minima or realize a projective mapping. For example, the latter can be obtained by introducing a topological structure in the index set Z and denoting this strucure as A, usually a regular grid. The resulting vector quantization scheme is the self-organizing map (SOM) introduced by Kohonen (1997). The respective cost function (in the variant of Heskes, 1999) is
4.4
with the so-called neighborhood function
and ‖r − r′‖A is the distance in A according to the topological structure. K(σ) is a normalization constant depending on the neighborhood range σ. For this SOM, the mapping rule, equation 4.1, is modified to
4.5
which yields in the limit σ → 0 the original mapping (see equation 4.1). The prototype update for all prototypes then is given as (Heskes, 1999)
4.6
As above, the utilization of a divergence-based update is straightforward for SOM as well.
If the aspect of projective mapping can be ignored while keeping the neighborhood cooperativeness aspect to avoid local minima in vector quantization, then the neural gas algorithm (NG) is an alternative to SOM presented by Martinetz, Berkovich, and Schulten (1993). The cost function of NG to be minimized is
4.7
with
4.8
with the rank function
4.9
The mapping is realized as in basic VQ (see equation 4.1), and the prototype update for all prototypes is similar to that of SOM:
4.10
Again, the incorporation of divergences is obvious also for NG.

4.1.3.  Further Vector Quantization Approaches.

There exists a long list of other vector quantization approaches, like kernelized SOMs (Hulle, 2000, 2002a, 2002b), generative topographic mapping (GTM; Bishop, Svensén, & Williams, 1998), and soft topographic mapping (Graepel, Burger, & Obermayer, 1998), to name just a few. Most of them use the Euclidean metric and the respective derivatives for adaptation. Thus, the idea of divergence-based processing can be transferred to these in a similar manner.

A somewhat reverse SOM has been proposed recently for embedding data into an embedding space : exploration machine (XOM; Wismüller, 2009). This XOM can be seen as a projective structure preserving mapping of the input data into the embedding space and therefore shows similarities to MDS. In the XOM approach, the data points , k = 1, …, N are uniquely associated with prototypes in the embedding space and W = {wk}Nk=1. The dissimilarity in the embedding space usually is chosen to be the quadratic Euclidean metric. Further, a hypothesis about the topological structure of the data vk to be embedded is formulated for the embedding space by defining a probability distribution for so-called sampling vectors . A cost function of XOM can be defined as
4.11
with the mapping rule
4.12
as pointed out in Bunte, Hammer, Villmann, Biehl, and Wismüller (2010). As in usual SOMs, the neighborhood cooperativeness is given in XOMs by a gaussian,
with the data dissimilarity ξV(vk, vj) defined as Euclidean distance in the original XOM. The update of the prototypes in the embedding space is obtained in complete analogy to SOM as
4.13
As one can see, we can apply divergences to both ξV and . In case of the latter, the prototype update, equation 4.13, has to be changed accordingly using the respective Fré chet derivatives.

4.2.  Learning Vector Quantization.

Learning vector quantization (LVQ) is the supervised counterpart of basic VQ. Now the data to be learned are equipped with class information cv. Suppose we have K classes; we define cv ∈ [0, 1]K. If ∑Kk=1ci = 1, the labeling is probabilistic, and possibilistic otherwise. In case of a probabilistic labeling with cv ∈ {0, 1}K, the labeling is called crisp.

We now briefly explore how divergences can be used for supervised learning. Again we start with the widely applied basic LVQ approaches and then outline the procedure for some more sophisticated methods without any claim of completeness.

4.2.1.  Basic LVQ Algorithms.

The basic LVQ schemes were invented by Kohonen (1997). For standard LVQ, a crisp data labeling is assumed. Further, the prototypes wj with labels yj correspond to the K classes in such a way that at least one prototype is assigned to each class. For simplicity, we take exactly one prototype for each class. The task is to distribute the prototypes in such a manner that the classification error is reduced. The respective algorithms LVQ1 to LVQ3 are heuristically motivated.

As in the unsupervised vector quantization, the similarity between data and prototypes for LVQs is judged by a dissimilarity measure ξ(v, wj). Beside some small modifications, the basic LVQ schemes LVQ1 to LVQ3 mainly consist of determination of the most proximate prototype(s) ws(v) for given v according to the mapping rule, equation 4.1, and subsequent adaptation. Depending on the agreement of cv and ys(v) the adaptation of the prototype(s) takes place according to
4.14
and α = 1 iff cv = ys(v), and α = −1 otherwise.

A popular generalization of these standard algorithms is the generalized LVQ (GLVQ) introduced by Sato and Yamada (1996). In GLVQ the classification error is replaced by a dissimilarity-based cost function that is closely related to the classification error but not identical to it.

For a given data point v, with class label cv, the two best matching prototypes with respect to the data metric ξ, usually the quadratic Euclidian, are determined: has minimum distance under the constraint that the class labels are identical: . The other best prototype, , has the minimum distance supposing the class labels are different: . Then the classifier function μ(v) is defined as
4.15
being negative in case of a correct classification. The value ξ+ − ξ yields the hypothesis margin of the classifier (Crammer, Gilad-Bachrach, Navot, & Tishby, 2002). Then the generalized LVQ (GLVQ) is derived as gradient descent on the cost function
4.16
with respect to the prototypes. In each learning step, for a given data point, both and are adapted in parallel. Taking the derivatives and , we get for the updates
4.17
and
4.18
with the scaling factors
4.19
The values ϵ+ and ϵ ∈ (0, 1) are the learning rates.

Obviously the distance measure ξ could be replaced for all of these LVQ schemes by one of the introduced divergences. This offers a new possibility for information-theoretic learning in classification schemes, which differs from the previous approaches significantly. These earlier approaches stress the information-optimum class representation, whereas here, the expected information loss in terms of the applied divergence measure is optimized (Torkkola & Campbell, 2000; Torkkola, 2003; Villmann, Hammer, et al., 2008).

Apart from the basic LVQ schemes, many more sophisticated prototype-based learning schemes are proposed for classification learning. Here we will restrict ourselves to approaches that can deal with probabilistic or possibilistic labeled training data (uncertain decisions) that are, in addition, related to the basic unsupervised and supervised vector quantization algorithms mentioned in this letter so far.

In particular, we focus on the fuzzy-labeled SOM (FLSOM) and the very similar fuzzy-labeled NG (FLNG) (Villmann, Schleif, Kostrzewa, Walch, & Hammer, 2008; Villmann, Hammer, Schleif, Geweniger, & Herrmann, 2006). Both approaches extend the cost function of its unsupervised counterpart in the following shorthand manner,
where EFL measures the classification accuracy. The factor in β ∈ [0, 1) is a factor balancing unsupervised and supervised learning. The classification accuracy term EFL is defined as
4.20
where gγ(v, wr) is a gaussian kernel describing a neighborhood range in the data space
4.21
using the dissimilarity ξ(v, wr) in the data space. ψ(cv,yr) judges the dissimilarities between label vectors of data and prototypes. ψ(cv,yr) is originally suggested to be the quadratic Euclidean distance.
Note that EFL depends on the dissimilarity in the data space ξ(v, wr) via gγ(v, wr). Hence, prototype adaptation in FLSOM/FLNG is influenced by the classification accuracy
4.22
which yields
4.23
The label adaptation is influenced only by the second part, EFL. The derivative yields
4.24
with learning rate ϵl>0 (Villmann, Schleif, et al., 2008; Villmann et al., 2006). This label learning leads to a weighted average yr of the fuzzy labels cv of those data v, that are close to the associated prototypes according to ξ(v, wr).

It should be noted at this point that a similar approach can easily be installed for XOM in an analog manner, yielding FLXOM.

Clearly, beside the possibility of choosing a divergence measure for ξ(v, wr) as in the unsupervised case, there is no contradiction to do so for the label dissimilarity ψ(cv,yr) in these FL methods. As before, the simple plug-in of the respective discrete divergence variants and their Fréchet derivatives modifies the algorithms such that semisupervised learning can proceed by relying on divergences for both variants.

5.  SOM Simulations for Various Divergences

In this section, we demonstrate the influence of the chosen divergence and the dependence on divergence parameters for prototype-based unsupervised vector quantization. For this purpose, we consider an artificial but illustrative data set. In the case of parameterized divergences, we vary the parameter settings to show their dependence on the resulting prototype distribution. Further, we investigate the behavior of different divergence types but always comparing the results with Euclidean distance-based learning as the standard to show their differences.

These investigations for the toy problem should lead readers to think about the choice of divergences for a specific application as well as optimum parameter settings. The demonstration itself is far from a realistic scenario, which also has to deal with such matters as high-dimensional problems and heterogeneous data distributions.

As an example vector quantization model, we consider the Heskes-SOM according to equation 4.4 using a chain lattice with 100 units r and their prototypes . The example data distribution consists of 107 data points v = (v1, v2) ∈ [0, 1]2. which are constrained such that v1 + v2 = 1 (i.e., the data v can be taken as probability densities in ). Further, generating the data set, the first argument v1 is chosen randomly according to the data density P1(v1) = 2 · v1, whereas v2 is subsequently calculated according to the constraint.

The learning rate ε as well as the neighborhood range σ converged during the SOM learning to the final values εfinal = 10−6 and σfinal = 1, respectively. The initial values for the learning rate ε as well as the neighborhood range σ were appropriately chosen.

We trained SOM networks for the divergences as introduced in section 2 using the Fréchet derivatives deduced in section 3.1 with different parameter values.

For the η-divergence (belonging to the Bregman divergences) the results are depicted in Figure 2. One can observe that the influence of the parameter η is only marginal. Yet small variations can be detected. For the special choice η = 2, Euclidean learning is realized.

Figure 2:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 2:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

For the β-divergence, the influence of the parameter value β is stronger than the parameter effect for η-divergences (see Figure 3). In particular, significant deviations can be observed for higher prototype w1-values, giving a hint of a better discrimination property for this probability range. Lower prototype w1-values were captured by the β-divergences markedly better than by the Euclidean learning.

Figure 3:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 3:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

The α-divergence based learning shows the inclusive and exclusive properties mentioned above. For a positive choice of the control parameter α the range of prototype w1-values captured is quite larger than the one covered using negative α-values. However, only small variations can be detected within the two α-domains (positive and negative); that is, the divergence is relatively robust with respect to the control parameter α (see Figure 4).

Figure 4:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 4:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

For the Tsallis divergence, the influence of the control parameter α is already detected in the central range of prototype w1-values and significant in the upper range (see Figure 5). Especially in comparison to Euclidean learning, this gives a hint of a quite good discrimination property for a wide probability range.

Figure 5:

Prototype distribution for Tsallis divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 5:

Prototype distribution for Tsallis divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

In contrast to the β-divergence, the influence of the control parameter α of the Rényi divergence is primarily detected in the region with sparse data density (see Figure 6). However, the Rényi divergence-based learning covers a wider range of prototype w1-values than the Euclidean learning.

Figure 6:

Prototype distribution for Rényi divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 6:

Prototype distribution for Rényi divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

The γ-divergence shows the most sensitive behavior of all parameterized divergences investigated here (see Figure 7). In particular, the choice of the control parameter γ influences both ranges of probability—the low and the high one—with approximately the same sensitivity (see Figure 7).

Figure 7:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 7:

Prototype distribution for -divergence-based SOM for different -values. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Thus, it differs from the sensitivity observed for β-divergences. This behavior offers the possibility of tuning the divergence precisely depending on the specific vector quantization task. Together with stated robustness of the γ-divergence (Fujisawa & Eguchi, 2008), this adaptive specificity could provide a high potential for a wide range of application. This is underscored by the applications in supervised and unsupervised vector quantization based on the Cauchy-Schwarz divergence (γ = 1) (Jenssen et al., 2006; Mwebaze et al., 2010; Principe et al., 2000; Villmann, Haase, Schleif, & Hammer, 2010).

Figure 8 shows the results of the prototype-based unsupervised vector quantization using various nonparameterized divergences.

Figure 8:

Prototype distribution for divergence-based SOM, using various divergences. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 8:

Prototype distribution for divergence-based SOM, using various divergences. Horizontal axis: logarithmic value of the one-dimensional prototype index. Vertical axis: first component w1 of the prototypes w = (w1, w2).

These simulations should be seen, on one hand, as a proof of concept. On the other hand, one can clearly see quite different behavior for the various divergences, resulting in distinguished prototype distributions. This leads, in consequence, to diverse vector quantization properties. Therefore, the choice of a divergence for a specific application should be made very carefully, taking the special properties of the divergences into account.

6.  Extensions for the Basic Adaptation Scheme: Hyperparameter and Relevance Learning

6.1.  Hyperparameter Learning for ⁠, and Divergences.

6.1.1.  Theoretical Considerations.

Considering the parameterized divergence families of γ-, α-, β-, and η-divergences, one could further think about the optimal choice of the so-called hyperparameters γ, α, β, η as suggested in a similar manner for other parameterized LVQ algorithms (Schneider, Biehl, & Hammer, 2009). In case of supervised learning schemes for classification based on differentiable cost functions, the optimization can be handled as an object of a gradient descent–based adaptation procedure. Thus, the parameter is optimized for the classification task at hand.

Suppose the classification accuracy for a certain approach is given as
depending on a parameterized divergence ξθ with parameter θ. If E and ξθ are both differentiable with respect to θ according to
a gradient-based optimization is derived by
depending on the derivative for a certain choice of the divergence ξθ.

We assume in the following that the (positive) measures p and ρ are continuously differentiable. Then, considering derivatives of parameterized divergences with respect to the parameter θ, it is allowed to interchange integration and differentiation if the resulting integral exists (Fichtenholz, 1964). Hence, we can differentiate parameterized divergences with respect to their hyperparameter in that case. For the several α-, β-, γ -, and η-divergences, characterized in section 2, we obtain after some elementary calculations:

• •
η-divergence Dη(p||ρ) from equation 2.8:
• •
β-divergence Dβ(p||ρ) from equation 2.9 (see appendix A):
• •
α-divergence Dα(p||ρ) from equation 2.18 (see appendix A):
• •
Tsallis divergence DTα(p||ρ) from equation 2.22:
• •
Generalized Rényi divergence DGRα(p||ρ) from equation 2.26 (see appendix A):
• •
Rényi divergence DRα(p||ρ) from equation 2.28:
• •
γ-divergence Dγ(p||ρ) from equation 2.30 (see appendix A):

6.1.2.  Example: Hyperparameter Learning for γ-Divergences in GLVQ.

We now provide a simulation example for hyperparameter learning. We apply the GLVQ algorithm for classification, the cost function of which is given by equation 4.16. Mwebaze et al. (2010) pointed out that GLVQ performs weakly if the Kullback-Leibler divergence is used, whereas Cauchy-Schwarz divergence yields good results. Therefore, we demonstrate hyperparameter learning for the γ-divergence, which includes both Kullback-Leibler- and Cauchy-Schwarz divergence by the parameter settings γ → 0 and γ = 1, respectively. The hyperparameter update for this algorithm reads as
with the scaling factors θ+ and θ taken from equation 4.19.

For this purpose, we investigate a simple classification example: the well-known three-class IRIS data set. We rescaled the data vectors such that the requirements of positive measures are satisfied. We used two prototypes for each class and 10-fold cross-validation. We initialized the γ -parameter as γ0 = 0.5 to be in the middle between Kullback-Leibler and Cauchy-Schwarz divergence according to Mwebaze et al. (2010).

Without a γ-parameter update for γ = 0, a classification accuracy of 78.34% is obtained with standard deviation σ = 6.17, with the best result being 91.3%. For γ = 1, the average is 95.16%, σ = 1.87 with the best-performed run yielding 97.3%. The hyperparameter-controlled simulations give only a slight improvement achieving, an average performance of 95.89% but with decreased deviation σ = 0.43. The γ -parameter converged to γfinal = 0.9016 with standard deviation σγ < 10−4. As expected from the noncontrolled experiments, the final γ-value is in the proximity of the Cauchy-Schwarz divergence. However, it is slightly but certainly decreased. A typical learning progress of γ is depicted in Figure 9. As for the Cauchy-Schwarz divergence (γ = 1), the best performance was 97.3% for the controlled case.

Figure 9:

Example run of -parameter control for the -divergence in the case of GLVQ applied to the well-known Iris data set.

Figure 9:

Example run of -parameter control for the -divergence in the case of GLVQ applied to the well-known Iris data set.

Summarizing, this small experiment shows that hyperparameter optimization works well and may lead to better performance and stability.

6.2.  Relevance Learning for Divergences.

Density functions are required to fulfill the normalization condition, whereas positive measures are more flexible. This offers the possibility of transferring the idea of relevance learning to divergence-based learning vector quantization. Relevance learning in learning vector quantization is weighting the input data dimensions such that classification accuracy is improved (Hammer & Villmann, 2002).

In the framework of divergence-based gradient descent learning, we multiplicatively weight a positive measure q(x) by λ(x) with 0 ⩽ λ(x) < ∞ and the regularization condition ∫λ(x)dx = 1. Incorporating this idea into the above approaches, we have to replace in the divergences p by p · λ and ρ by ρ · λ. Doing so, we can optimize λ(x) during learning for better performance by gradient descent optimization as it is known from vectorial relevance learning. This leads, again, to Fréchet derivatives of the divergences but now with respect to the weighting function λ(x). The respective framework based on GLVQ for vectorial data is given by the generalized relevance learning vector quatization scheme (GRLVQ; Hammer & Villmann, 2002). In complete analogy, we obtain the functional relevance update,
with s+(p) and s(p) playing the same role as in GLVQ. For vectorial representations v and w of p and ρ, respectively, this reduces to the ordinary partial derivatives:
Applying this methodology, we obtain for the Bregman divergence,
6.1
with
This yields the generalized Kullback-Leibler divergence:
6.2
In the case of the η-divergence (equation 2.8), we calculate
6.3
which reduces for the choice η = 2 (Euclidean distance) to
as it is known from Hammer and Villmann (2002). Further, for the β-divergence, equation 2.9, which also belongs to the Bregman divergence class, we have
6.4
For the class of f-divergences, equation 2.11, we consider
6.5
with using the fact that . The relevance learning of the subclass of α-divergences, equation 2.18, follows,
6.6
whereas the respective gradient of generalized Rényi divergences, equation 2.26, can be derived from this as
6.7
The subset of Tsallis divergences is treated by
6.8
The γ-divergence classes finally yield
Again the important special case γ = 1 is considered: the relevance learning scheme for the Cauchy-Schwarz divergence, equation 2.32, is derived as
6.9

7.  Conclusion

Divergence-based supervised and unsupervised vector quantization has been done so far by applying only a few divergences, primarily Kullback-Leibler divergence. Recent applications also refer to Itakura-Saito divergence, Cauchy divergence, and γ-divergence. These approaches are not online adaptation schemes involving gradient learning but are based on batch mode, requiring all the data at one time. However, in many cases, online learning is mandatory, for several reasons: the huge amount of data, a subsequently inreasing data set, or the need for very careful learning in complex problems, for example (Alex, Hasenfuss, & Hammer, 2009). In these cases, online learning is required or may be at least, advantageous.

In this letter we give a mathematical foundation for gradient-based vector quantization bearing on the derivatives of the applied divergences. We provide a general framework for the use of arbitrary divergences and their derivatives such that they can immediately be plugged into existing gradient-based vector quantization schemes.

For this purpose, we first characterized the main subclasses of divergences—Bregman-, α-, β-, γ-, and f -divergences—following Cichocki et al. (2009). We then used the mathematical methodology of Fréchet derivatives to calculate the functional divergence derivatives.

We show how to use this methodology with famous examples of supervised and unsupervised vector quantization, including SOM, NG, and GLVQ. In particular, we explained that the divergences can be taken as suitable dissimilarity measures for data, which leads to the use of the respective Fréchet derivatives in the online learning schemes. Further, we declare how a parameter adaptation could be integrated in supervised learning to achieve improved classification results in case of the parameterized α -, β-, γ-, and η-divergences. In the last step, we considered a weighting function for generalized divergences based on a positive measure. The optimization scheme for this weight function is obtained by Fréchet derivatives again to obtain a relevance learning scheme in analogy to relevance learning in the usual supervised learning vector quantization (Hammer & Villmann, 2002).

Table 1 provides an overview of representatives of the three main classes of divergence characterized in section 2 and their related Fré chet derivatives. Table 2 provides the received derivatives for relevance learning and hyperparameter learning.

Table 1:
Table of Divergences and Their Fréchet Derivatives.
Table 2:
Table of Divergences and Their Derivatives for Relevance Learning and Hyperparameter Learning.

As a proof of concept, the simulations for an illustrative example for the several parametric and nonparametric divergences give promising results regarding their sensitivity. The differences with Euclidean learning are obvious. Moreover, the dependencies in case of parameterized divergences give hints for possible real-world application, which should be the next step in this work.

Appendix A:  Calculation of the Derivatives of the Parameterized Divergences with Respect to the Hyperparameters

We assume for the differentiation of the divergences with respect to their hyperparameters that the (positive) measures p and ρ are continuously differentiable. Then, considering derivatives of divergences, integration and differentiation can be interchanged, if the resulting integral exists (Fichtenholz, 1964).

A.1.  -divergence.

The β-divergence is, according to equation 2.9,
We treat both integrals independently:
Thus,
if the integral exists for an appropriate choice of β.

A.2.  -Divergences.

We consider the α-divergence, equation 2.18:
We have
The derivative yields
Finally, we get

A.3.  Rényi Divergences.

Considering the generalized Rényi divergence DGRα(p||ρ) from equation 2.26,
we get
with
Summarizing the differentiation yields
We now turn to the usual Rényi divergence DRα(p||ρ) from equation 2.28:
We analogously achieve

A.4.  -Divergences.

The remaining divergences are the γ-divergences, equation 2.30:
The derivative is obtained according to
Next, we calculate the derivatives , , and :
Collecting all intermediate results, we finally have

Appendix B:  Proof of Lemma 1

We now give the proof of lemma 1. For the proof, we need a proposition given in Liese and Vajda (1987):

Proposition 3.
Let A = [0, ∞)2 and . Further, let f be a function be defined by
for an arbitrary with the definitions , . Further, let us denote and . Then there exists such that
and

Proof.

See Liese and Vajda (1987).

This proposition gives the essential ingredients to proof the lemma:

Lemma.
The f-divergence Df for positive measures p and ρ is bounded (if the limit exists and it is finite):
with.

Proof.
Let p* be a nonnegative integrable function defined as
Further, let us define
Then it follows directly from the above proposition that there is such that
With f being a determining function of an f-divergence, it holds that f(1) = 0 and thus
We now get
Since p and ρ are positive measures with weights W(p) ⩽ 1 and W(ρ) ⩽ 1 according to equation 2.1, this finally yields
which completes the proof of the lemma.

Notes

1

Each set of arbitrary nonnegative integrable functionals f with domain V can be transformed into a set of positive measures simply by with .

2

If S follows a statistical distribution with existing functional expectation value ES, then the mean μ can be replaced by ES.

3

The relations and hold.

4

The equality holds iff the conditional densities and are identical (see Amari & Nagaoka, 2000).

5

A careful transformation of the parameter α is required for exact transformations between both divergences. For details, see Amari (1985) and Cichocki et al. (2009). Further, this statement was given in this book without proving the bounds of the underlying f-divergence for positive measures as it is given in this letter by lemma 1.

6

The divergence Dγ(p||ρ) is proposed to be robust for γ ∈ [0, 1] with the existence of Dγ=0 in the limit γ → 0. A detailed analysis of robustness is given in Fujisawa and Eguchi (2008).

References

Alex
,
N.
,
Hasenfuss
,
A.
, &
Hammer
,
B.
(
2009
).
Patch clustering for massive data sets
.
Neurocomputing
,
72
(
7
9
),
1455
1469
.
Amari
,
S.-I.
(
1985
).
Differential-geometrical methods in statistics
.
Berlin
:
Springer
.
Amari
,
S.-I.
, &
Nagaoka
,
H
. (
2000
).
Methods of information geometry
.
New York
:
Oxford University Press
.
Banerjee
,
A.
,
Merugu
,
S.
,
Dhillon
,
I.
, &
Ghosh
,
J.
(
2005
).
Clustering with Bregman divergences
.
Journal of Machine Learning Research
,
6
,
1705
1749
.
Basseville
,
M
(
1988
).
Distance measures for signal processing and pattern recognition
.
(Tech. Rep. 899).
Paris
:
Institut National de Recherche en Informatique et en Automatique
.
Basu
,
A.
,
Harris
,
I.
,
Hjort
,
N.
, &
Jones
,
M.
(
1998
).
Robust and efficient estimation by minimising a density power divergence
.
Biometrika
,
85
(
3
),
549
559
.
Bertin
,
N.
,
Fevotte
,
C.
, &
,
R.
(
2009
).
A tempering approach for Itakura-saito non-negative matrix factorization. with application to music transcription
. In:
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(pp.
1545
1548
).
Piscataway, NJ
:
IEEE Press
.
Bishop
,
C. M.
,
Svensén
,
M.
,
Williams
, &
C.K.I.
(
1998
).
GTM: The generative topographic mapping
.
Neural Computation
,
10
,
215
234
.
Bregman
,
L.
(
1967
).
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
.
USSR Computational Mathematics and Mathematical Physics
,
7
(
3
),
200
217
.
Bunte
,
K.
,
Hammer
,
B.
,
Villmann
,
T.
,
Biehl
,
M.
, &
Wismüller
,
A.
(
2010
).
Exploratory observation machine (XOM) with Kullback-Leibler divergence for dimensionality reduction and visualziation.
In M. Verleysen (Ed.)
,
Proc. of European Symposium on Artificial Neural Networks
(pp.
87
92
).
Evere, Belgium
:
d-side publications
.
Cichocki
,
A.
, &
Amari
,
S.-I.
(
2010
).
Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similarities
.
Entropy
,
12
,
1532
1568
.
Cichocki
,
A.
,
Lee
,
H.
,
Kim
,
Y.-D.
, &
Choi
,
S.
(
2008
).
Non-negative matrix factorization with α-divergence
.
Pattern Recognition Letters
,
29
,
1433
1440
.
Cichocki
,
A.
,
Zdunek
,
R.
,
Phan
,
A.
, &
Amari
,
S.-I.
(
2009
).
Nonnegative matrix and tensor factorizations
.
Hoboken, NJ
:
Wiley
.
Crammer
,
K.
,
,
R.
,
Navot
,
A.
, &
Tishby
,
A.
(
2002
).
Margin analysis of the LVQ algorithm
.
In S. Becker, S. Thrün, & K. Obermayer (Eds.)
,
Advances in neural information processing systems
,
15
(pp.
462
468
):
Cambridge, MA
:
MIT Press
.
Csiszár
,
I.
(
1967
).
Information-type measures of differences of probability distributions and indirect observations
.
Studia Sci. Math. Hungaria
,
2
,
299
318
.
Eguchi
,
S.
, &
Kano
,
Y.
(
2001
).
Robustifying maximum likelihood estimation
(
Tech. Rep. 802
).
Tokyo
:
Tokyo Institute of Statistical Mathematics
.
Erdogmus
,
D.
(
2002
).
Information theoretic learning: Renyi's entropy and its application to adaptive systems training
.
Unpublished doctoral dissertation, University of Florida
.
Fichtenholz
,
G.
(
1964
).
Differential- und Integralrechnung
(9th ed.).
Berlin
:
Deutscher Verlag der Wissenschaften
.
Frigyik
,
B.
,
Srivastava
,
S.
, &
Gupta
,
M.
(
2008a
).
Functional Bregman divergence and Bayesian estimation of distributions
.
IEEE Transactions on Information Theory
,
54
(
11
),
5130
5139
.
Frigyik
,
B. A.
,
Srivastava
,
S.
, &
Gupta
,
M.
(
2008b
).
An introduction to functional derivatives
(Tech. Rep. UWEETR-2008-0001). Seattle: Department of Electrical Engineering, University of Washington
.
Fujisawa
,
H.
, &
Eguchi
,
S.
(
2008
).
Robust parameter estimation with a small bias against heavy contamination
.
Journal of Multivariate Analysis
,
99
,
2053
2081
.
Graepel
,
T.
,
Burger
,
M.
, &
Obermayer
,
K.
(
1998
).
Self-organizing maps: Generalizations and new optimization techniques
.
Neurocomputing
,
21
(
1–3
),
173
190
.
Hammer
,
B.
, &
Villmann
,
T.
(
2002
).
Generalized relevance learning vector quantization
.
Neural Networks
,
15
(
8–9
),
1059
1068
.
Hegde
,
A.
,
Erdogmus
,
D.
,
Lehn-Schiøler
,
T.
,
Rao
,
Y.
, &
Principe
,
J.
(
2004
).
Vector quantization by density matching in the minimum Kullback-Leibler-divergence sense.
In
Proc. of the International Joint Conference on Artificial Neural Networks
(pp.
105
109
).
Piscataway, NJ
:
IEEE Press
.
Heskes
,
T.
(
1999
).
Energy functions for self-organizing maps
. In
E.
Oja
&
S.
(Eds.),
Kohonen maps
(pp.
303
316
).
Amsterdam
:
Elsevier
.
Hulle
,
M.M.V.
(
2000
).
Faithful representations and topographic maps
.
Hoboken, NJ
:
Wiley
.
Hulle
,
M.M.V.
(
2002a
).
Joint entropy maximization in kernel-based topographic maps
.
Neural Computation
,
14
(
8
),
1887
1906
.
Hulle
,
M.M.V.
(
2002b
).
Kernel-based topographic map formation achieved with an information theoretic approach
.
Neural Networks
,
15
,
1029
1039
.
Itakura
,
F.
, &
Saito
,
S.
(
1973
).
Analysis synthesis telephony based on the maximum likelihood method
. In
J.
Flanagan
&
R.
Rabiner
(Eds.),
Speech synthesis
(pp.
289
292
).
Stroudsburg, PA
:
Dowden, Hutchinson, & Ross
.
Jang
,
E.
,
Fyfe
,
C.
, &
Ko
,
H.
(
2008
).
Bregman divergences and the self organising map. In C. Fyfe
,
D. Kim, S.-Y., Lee, & H. Yin (Eds.), Intelligent data engineering and automated learning
(pp. 452–458). New York: Springer.
Jenssen
,
R.
(
2005
).
An information theoretic approach to machine learning
.
Unpublished doctoral dissertation, University of Tromsø
.
Jenssen
,
R.
,
Principe
,
J.
,
Erdogmus
,
D.
, &
Eltoft
,
T.
(
2006
).
The Cauchy-Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels
.
Journal of the Franklin Institute
,
343
(
6
),
614
629
.
Kantorowitsch
,
I.
, &
Akilow
,
G.
(
1978
).
Funktionalanalysis in normierten Räumen
(2nd ed.).
Berlin
:
.
Kapur
,
J.
(
1994
).
Measures of information and their application
.
Hoboken, NJ
:
Wiley
.
Kohonen
,
T.
(
1997
).
Self-organizing maps
(2nd ext. ed.).
New York
:
Springer
.
Kullback
,
S.
, &
Leibler
,
R.
(
1951
).
On information and sufficiency
.
Annals of Mathematical Statistics
,
22
,
79
86
.
Lai
,
P.
, &
Fyfe
,
C.
(
2009
).
Bregman divergences and multi-dimensional scaling
. In
M.
Köppen
,
N.
,
Kasabov
, &
N. G.
Coghill
(Eds.),
Proceedings of the International Conference on Information Processing 2008
(pp. 935–942).
New York
:
Springer
.
Lee
,
J.
, &
Verleysen
,
M.
(
2005
).
Generalization of the lp norm for time series and its application to self-organizing maps
. In
M.
Cottrell
(Ed.),
Proc. of Workshop on Self-Organizing Maps
(pp. 733–740).
Paris
:
Sorbonne
.
Lee
,
J.
, &
Verleysen
,
M.
(
2007
).
Nonlinear dimensionality reduction
.
New York
:
Springer
.
Lehn-Schiøler
,
T.
,
Hegde
,
A.
,
Erdogmus
,
D.
, &
Principe
,
J.
(
2005
).
Vector quantization using information theoretic concepts
.
Natural Computing
,
4
(
1
),
39
51
.
Liese
,
F.
, &
Vajda
,
I.
(
1987
).
Convex statistical distances
.
Leipzig
:
Teubner-Verlag
.
Liese
,
F.
, &
Vajda
,
I.
(
2006
).
On divergences and informations in statistics and information theory
.
IEEE Transactions on Information Theory
,
52
(
10
),
4394
4412
.
Linde
,
Y.
,
Buzo
,
A.
, &
Gray
,
R.
(
1980
).
An algorithm for vector quantizer design
.
IEEE Transactions on Communications
,
28
,
84
95
.
Maaten
,
L.
, &
Hinton
,
G.
(
2008
).
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
,
2579
2605
.
Martinetz
,
T. M.
,
Berkovich
,
S. G.
, &
Schulten
,
K. J.
(
1993
).
“Neural-gas” network for vector quantization and its application to time-series prediction
.
IEEE Trans. on Neural Networks
,
4
(
4
),
558
569
.
Minami
,
M.
, &
Eguchi
,
S.
(
2002
).
Robust blind source separation by beta divergence
.
Neural Computation
,
14
,
1859
1886
.
Minka
,
T.
(
2005
).
Divergence measures and message passing
(Tech. Rep. 173)
.
Cambridge, UK
:
Microsoft Research
.
Mwebaze
,
E.
,
Schneider
,
P.
,
Schleif
,
F.-M.
,
Haase
,
S.
,
Villmann
,
T.
, &
Biehl
,
M.
(
2010
).
Divergence based learning vector quantization
. In
M.
Verleysen
(Ed.),
Proc. of European Symposium on Artificial Neural Networks
(pp. 247–252).
Evere, Belgium
:
D-side
.
Nielsen
,
F.
, &
Nock
,
R.
(
2009
).
Sided and symmetrized Bregman centroids
.
IEEE Transactions on Information Theory
,
55
(
6
),
2882
2903
.
Principe
,
J. C. III
, &
Xu
,
D.
(
2000
).
Information theoretic learning. In S. Haykin & J. Fisher (Eds.)
,
.
Hoboken, NJ
:
Wiley
.
Qiao
,
Y.
, &
Minematsu
,
N.
(
2008
).
f-divergence is a generalized invariant measure between disributions
. In
INTERSPEECH—Proc. of the Annual Conference of the International Speech Communication Association (pp. 1349–1352)
.
N.p.
:
International Speech Communication Association
.
Ramsay
,
J.
, &
Silverman
,
B.
(
2006
).
Functional data analysis
(2nd ed.)
New York
:
Springer
.
Renyi
,
A.
(
1961
).
On measures of entropy and information
. In
Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability
.
Berkeley
:
University of California Press
.
Renyi
,
A.
(
1970
).
Probability theory
.
Amsterdam
:
North-Holland
.
Rossi
,
F.
,
Delannay
,
N.
,
Conan-Gueza
,
B.
, &
Verleysen
,
M.
(
2005
).
Representation of functional data in neural networks
.
Neurocomputing
,
64
,
183
210
.
Santos-Rodríguez
,
R.
,
Guerrero-Curieses
,
A.
,
Alaiz-Rodríguez
,
R.
, &
Cid-Sueiro
,
J.
(
2009
).
Cost-sensitive learning based on Bregman divergences
.
Machine Learning
,
76
(
2–3
),
271
285
.
Sato
,
A.
, &
,
K.
(
1996
).
Generalized learning vector quantization
. In
D. S.
Touretzky
,
M. C.
Mozer
, &
M. E.
Hasselmo
(Eds.),
Advances in neural information processing systems
,
8
(pp. 423–429).
Cambridge, MA
:
MIT Press
.
Schneider
,
P.
,
Biehl
,
M.
, &
Hammer
,
B.
(
2009
).
Hyperparameter learning in robust soft LVQ
. Ins
M.
Verleysen
(Ed.),
Proceedings of the European Symposium on Artificial Neural Networks
(pp. 517–522).
Evere, Belgium
:
D-side
.
Shannon
,
C.
(
1948
).
A mathematical theory of communication
.
Bell System Technical Journal
,
27
,
379
432
.
Taneja
,
I.
, &
Kumar
,
P.
(
2004
).
Relative information of type s, Csiszár's f-divergence, and information inequalities
.
Information Sciences
,
166
,
105
125
.
Torkkola
,
K.
(
2003
).
Feature extraction by non-parametric mutual information maximization
.
Journal of Machine Learning Research
,
3
,
1415
1438
.
Torkkola
,
K.
, &
Campbell
,
W.
(
2000
).
Mutual information in learning feature transformations
. In
Proc. of the International Conference on Machine Learning
.
San Francisco
:
Morgan Kaufmann
.
Villmann
,
T.
(
2007
).
Sobolev metrics for learning of functional data—mathematical and theoretical aspects
(Machine Learning Reports 1 1–15). Available online at http://www.uni-leipzig.de/compint/mlr/mlr_01_2007.pdf.
Villmann
,
T.
,
Haase
,
S.
,
Schleif
,
F.-M.
, &
Hammer
,
B.
(
2010
).
Divergence based online learning in vector quantization. In L. Rutkowski
,
W. Duch, J. Kaprzyk, J. Korbicz
, &
,
Proc. of the International Conference on Artifial Intelligence and Soft Computing
.
New York
:
Springer
.
Villmann
,
T.
,
Haase
,
S.
,
Simmuteit
,
S.
,
Haase
,
M.
, &
Schleif
,
F.-M.
(
2010
).
Functional vector quantization based on divergence learning
.
Ulmer Informatik-Berichte
,
2010-05
,
8
11
.
Villmann
,
T.
,
Hammer
,
B.
,
Schleif
,
F.-M.
,
Geweniger
,
T.
, &
Herrmann
,
W.
(
2006
).
Fuzzy classification by fuzzy labeled neural gas
.
Neural Networks
,
19
,
772
779
.
Villmann
,
T.
,
Hammer
,
B.
,
Schleif
,
F.-M.
,
Hermann
,
W.
, &
Cottrell
,
M.
(
2008
).
Fuzzy classification using information theoretic learning vector quantization
.
Neurocomputing
,
71
,
3070
3076
.
Villmann
,
T.
, &
Schleif
,
F.-M.
(
2009
).
Functional vector quantization by neural maps
. In
J.
Chanussot
(Ed.),
Proceedings of First Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing
(pp. 1–4).
Piscataway, NJ
:
IEEE Press
.
Villmann
,
T.
,
Schleif
,
F.-M.
,
Kostrzewa
,
M.
,
Walch
,
A.
, &
Hammer
,
B.
(
2008
).
Classification of mass-spectrometric data in clinical proteomics using learning vector quantization methods
.
Briefings in Bioinformatics
,
9
(
2
),
129
143
.
Wismüller
,
A.
(
2009
).
The exploration machine: A novel method for data visualization
. In
J.
Principe
&
R.
Miikkulainen
(Eds.),
Advances in self-organizing maps—Proceedings of the 7th International Workshop
(pp. 344–352).
New York
:
Springer
.
,
P. L.
(
1982
).
Asymptotic quantization error of continuous signals and the quantization dimension
.
IEEE Transactions on Information Theory
,
28
,
149
159
.