## Abstract

Supervised and unsupervised vector quantization methods for classification and clustering traditionally use dissimilarities, frequently taken as Euclidean distances. In this article, we investigate the applicability of divergences instead, focusing on online learning. We deduce the mathematical fundamentals for its utilization in gradient-based online vector quantization algorithms. It bears on the generalized derivatives of the divergences known as Fréchet derivatives in functional analysis, which reduces in finite-dimensional problems to partial derivatives in a natural way. We demonstrate the application of this methodology for widely applied supervised and unsupervised online vector quantization schemes, including self-organizing maps, neural gas, and learning vector quantization. Additionally, principles for hyperparameter optimization and relevance learning for parameterized divergences in the case of supervised vector quantization are given to achieve improved classification accuracy.

## 1. Introduction

Supervised and unsupervised vector quantization for classification and clustering is strongly associated with the concept of dissimilarity, usually judged in terms of distances. The most common choice is the Euclidean metric. Recently, however, alternative dissimilarity measures have become attractive for advanced data processing. Examples are functional metrics like Sobolev distances or kernel-based dissimilarity measures (Villmann & Schleif, 2009; Lee & Verleysen, 2007). These metrics take the functional structure of the data into account (Lee & Verleysen, 2005; Ramsay & Silverman, 2006; Rossi, Delannay, Conan-Gueza, & Verleysen, 2005; Villmann, 2007).

Information theory–based vector quantization approaches are proposed considering divergences for clustering (Banerjee, Merugu, Dhillon, & Ghosh, 2005; Jang, Fyfe, & Ko, 2008; Lehn-Schiøler, Hegde, Erdogmus, & Principe, 2005; Hegde, Erdogmus, Lehn-Schiøler, Rao, & Principe, 2004). For other data processing methods like multidimensional scaling (MDS; Lai & Fyfe, 2009), stochastic neighbor embedding (Maaten & Hinton, 2008), blind source separation (Minami & Eguchi, 2002), or nonnegative matrix factorization (Cichocki, Lee, Kim, & Choi, 2008), divergence-based approaches are also introduced. In prototype-based classification, first approaches using information-theoretic approaches have been proposed (Erdogmus, 2002; Torkkola, 2003; Villmann, Hammer, Schleif, Hermann, & Cottrell, 2008).

Yet a systematic analysis of prototype-based clustering and classification relying on divergences has not yet been given. Further, the existing approaches usually are carried out in batch mode for optimization but are not available for online learning, which requires calculating the derivatives of the underlying metrics (i.e., divergences).

In this letter, we offer a systematic approach for divergence-based vector quantization using divergence derivatives. For this purpose, important but general classes of divergences are identified, widely following and extending the scheme introduced by Cichocki, Zdunek, Phan, and Amari (2009). The mathematical framework for functional derivatives of continuous divergences is given by the functional-analytic generalization of common derivatives—the concept of Fréchet derivatives (Frigyik, Srivastava, & Gupta, 2008b; Kantorowitsch & Akilow, 1978). This can be seen as a generalization of partial derivatives for discrete variants of the divergences. The functional approach is here preferred for clarity. Yet it also offers greater flexibility in specific variants of functional data processing (Villmann, Haase, Simmuteit, Haase, & Schleif, 2010).

After characterizing the different classes of divergences and introducing Fréchet derivatives, we apply this framework to several divergences and divergence classes to obtain generalized derivatives, which can be used for online learning in divergence-based methods for supervised and unsupervised vector quantization as well as other gradient-based approaches. We explicitly explore the derivatives to provide examples.

Then we consider some of the most prominent approaches for unsupervised as well as supervised prototype-based vector quantization in the light of divergence-based online learning using Fré chet derivatives, including self-organizing maps (SOM), neural gas (NG), and generalized learning vector quantization (GLVQ). For the GLVQ supervised approach, we also provide a gradient learning scheme, hyperparameter adaptation, for optimizing parameters that occur in the case of parameterized divergences.

The focus of the letter is mainly on giving a unified framework for the application of widely ranged divergences and classes thereof in gradient-based online vector quantization and their mathematical foundation. We formulate the problem in a functional manner following the approaches in Frigyik et al. (2008b), Csiszár (1967), and Liese and Vajda (2006). This allows a compact description of the mathematical theory based on the concept of Fréchet derivatives. We also state that the functional approach includes a larger class of divergence functionals than the discrete (pointwise) approach as Frigyik, Srivastava, and Gupta (2008a) point out. Beside these extensions, the functional approach using Fré chet derivatives obviously reduces to partial derivatives for the discrete case. We therefore prefer the functional approach in this letter.

However, as a proof of concept, we show for several classes of parameterized divergences their utilization in SOM learning for an artificial but illustrate example in comparison to Euclidean distance learning as standard.

## 2. Characterization of Divergences

*p*and ρ with a domain

*V*and the constraints

*p*(

*x*) ⩽ 1 and ρ(

*x*) ⩽ 1 for all

*x*

**∈**

*V*. We denote such measure functions as positive measures. The weight of the functional

*p*is defined as Positive measures

*p*with weight

*W*(

*p*) = 1 are denoted as (probability) density functions.

^{1}

Divergences *D*(*p*||ρ) are defined as functionals that have to be nonnegative and zero iff *p* ≡ ρ except on a zero-measure set. Further, *D*(*p*||ρ) is required to be convex with respect to the first argument. Yet divergences are neither necessarily symmetric nor have to fulfill the triangle inequality as it is supposed for metrics. According to the classification given in Cichocki et al. (2009), one can distinguish at least three main classes of divergences: Bregman divergences, Csiszár's *f*-divergences, and γ-divergences emphasizing different properties. We offer some basic properties about these but do not go into detail about them because this would be outside the scope of the letter. (For detailed property investigations, see Cichocki & Amari, 2010, and Cichocki et al., 2009.)

We generally assume that *p* and ρ are positive measure (densities) that are not necessarily normalized. In case of (normalized) densities, we explicitly refer to these as probability densities.

### 2.1. Bregman Divergences.

Bregman divergences are defined by generating convex functions Φ in the following way using a functional interpretation (Bregman, 1967; Frigyik et al., 2008b).

The Bregman divergence *D ^{B}*

_{Φ}(

*p*||ρ) can be interpreted as a measure of convexity of the generating function Φ. Taking

*p*and ρ as points in a functional space,

*D*

^{B}_{Φ}(

*p*||ρ) plays the role of vertical distance between

*p*and the tangential hyperplane to the graph of Φ at point ρ, which is illustrated in Figure 1.

*D*

^{B}_{Φ}(

*p*||ρ) is invariant under affine transforms Γ(

*q*) = Φ(

*q*) + Ψ

_{g}[

*q*] + ξ for positive measures

*g*and

*q*with a requirement that Γ(

*q*), Φ(

*q*), and are not independent but have to be related according to Further, Ψ

_{g}is supposed to be a linear operator independent of

*q*(Frigyik et al., 2008a) and ξ is a scalar. In that case, is valid. Further, the generalized Pythagorean theorem holds for any triple

*p*, ρ, τ of positive measures: The sensitivity of a Bregman divergence at

*p*is defined as with and the restriction that ∫τ(

*x*)

*dx*= 0 (Santos-Rodríguez, Guerrero-Curieses, Alaiz-Rodríguez, & Cid-Sueiro, 2009). Note that is the Hessian of the generating function. The sensitivity

*s*(

*p*, τ) measures the velocity of change of the divergence at point

*p*in the direction of τ.

A last property mentioned here is an optimality one (Banerjee et al., 2005). Given a set *S* of positive measures *p* with the (functional) mean μ = *E*[*p* ∈ *S*] and the additional restriction that μ is a relative interior of *S*,^{2} then for given *p* ∈ *S*, the unique minimizer of *E _{p}*[

*D*

^{B}_{Φ}(

*p*||ρ)] is ρ = μ. The inverse direction of this statement is also true: if

*E*[

_{p}*D*(

^{B}_{F}*p*||ρ)] is minimum for ρ = μ, then

*D*(

^{B}_{F}*p*||ρ) is a Bregman divergence. This property predestinates Bregman divergences for clustering problems.

Finally, we give some important examples:

- • Generalized Kullback-Leibler divergence for non-normalized
*p*and ρ (Cichocki et al., 2009): with the generating function If*p*and ρ are normalized densities (probability densities),*D*(_{GKL}*p*||ρ) is reduced to the usual Kullback-Leibler divergence (Kullback & Leibler, 1951; Kapur, 1994), which is related to the Shannon-entropy (Shannon, 1948), via where is Shannon's cross-entropy. - • Itakura-Saito divergence (Itakura & Saito, 1973), based on the Burg entropy, which also serves as the generating function The Itakura-Saito divergence is also known as negative cross-Burg entropy and fulfills the scale-invariance property, that is,
*D*(_{IS}*c*·*p*||*c*· ρ) =*D*(_{IS}*p*||ρ). So the same relative weight is given to low- and high-energy components of*p*(Bertin, Fevotte, & Badeau, 2009). Due to this, the Itakura-Saito divergence is frequently applied in image processing and sound processing. - • The Euclidean distance in terms of a Bregman divergence is obtained by the generating function We extend this definition and introduce the parameterized version, defining the η-divergence, also known as norm-like divergence (Nielsen & Nock, 2009): which converges to the Euclidean distance for η → 2. To ensure the convexity of Φ(
*f*), the restriction to η>1 is required.

*p*and ρ are positive measures, then an important subset of Bregman divergences belongs to the class of β-divergences (Eguchi & Kano, 2001), which are defined, following Cichocki et al. (2009), as with β ≠ 1 and β ≠ 0 with the generating function In the limit β → 1, the divergence

*D*

_{β}(

*p*, ρ) becomes the generalized Kullback-Leibler divergence (see equation 2.4).

^{3}The limit β → 0 gives the Itakura-Saito divergence (see equation 2.7). Further, β-divergences are equivalent to the density power divergences introduced in Basu, Harris, Hjort, and Jones (1998) by Obviously, the η divergence (see equation 2.8) is a rescaled version of the β-divergence: Thus, we see that for β = 2, the β-divergence

*D*

_{β}(

*p*||ρ) becomes (half) the Euclidean distance.

### 2.2. Csiszár's -Divergences.

*f*-divergences are defined for real-valued, convex, continuous functions with

*f*(1) = 0 (without loss of generality) whereby The

*f*-divergences

*D*for positive measures

_{f}*p*and ρ are given by with the definitions , (Csiszár, 1967; Liese & Vajda, 2006; Taneja & Kumar, 2004).

*f*is called the determining function for

*D*(

_{f}*p*||ρ). It corresponds to a generalized

*f*-entropy (Cichocki et al., 2009) of the form via with being the constant function of value 1 and

*c*is a divergence-depending constant (Cichocki & Amari, 2010).

*f*-divergence

*D*can be interpreted as an average (with respect to ρ) of the likelihood ratio describing the change rate of

_{f}*p*with respect to ρ weighted by the determining function

*f*.

*D*(

_{f}*p*||ρ) is jointly convex in both

*p*and ρ. Further,

*f*defines an equivalence class in the sense that iff for , that is,

*D*(

_{f}*p*||ρ) is invariant according to a linear shift regarding the determining function

*f*. For

*f*-divergences, a certain kind of symmetry can be stated. Let and

*f** is the conjugate function of

*f*, that is, for

*x*∈ (0, ∞). Then the relation is valid iff the conjugate differs from the original by a linear shift as above:

*f*(

*x*) =

*f**(

*x*) +

*c*· (

*x*− 1). A symmetric divergence can be obtained for an arbitrary convex function using its conjugate

*g** for the definition

*f*=

*g*+

*g** as a determining function. Further, the conjugate is important for an upper bound of the divergence. Let and

*p*as well as ρ densities. Then the

*f*-divergence is bounded by if the limit exists, as it was shown in Liese and Vajda (1987). Yet this statement can extended to

*p*and ρ being positive measures:

*Let p and ρ be positive measures. Then the bounds given in equation 2.14 are still valid.*

The proof is given in appendix B.

*p*and ρ, which is similarly to the monotonicity of the Fisher metric (Amari & Nagaoka, 2000). Let , with being the range of

*y*. κ describes a transition probability density, that is, ∫κ(

*y*

**|**

*x*)

*dy*= 1 holds ∀

*x*

**∈**. Denoting the positive measures of

*y*derived from

*p*(

*x*) and ρ(

*x*) by

*p*

_{κ}(

*y*) and ρ

_{κ}(

*y*), the monotonicity is expressed by

*D*(

_{f}*p*||ρ) ⩾

*D*(

_{f}*p*

_{κ}||ρ

_{κ}).

^{4}Further, an isomorphism can be stated for

*f*-divergences in the following way. Let be an invertible function transforming positive measures

*p*

_{1}(

*x*) and ρ

_{1}(

*x*) to

*p*

_{2}(

*y*) and ρ

_{2}(

*y*). Then

*D*(

_{f}*p*

_{1}||ρ

_{1}) =

*D*(

_{f}*p*

_{2}||ρ

_{2}) holds, and the pairs (

*p*

_{1}, ρ

_{1}) and (

*p*

_{2}, ρ

_{2})are called isomorph (Liese & Vajda, 1987). Conversely, if a measure

*D*(

*p*||ρ) = ∫ρ(

*x*) ·

*G*(

*p*(

*x*), ρ(

*x*))

*dx*for an integrable function

*G*is invariant according to invertible transformations

*h*, then

*D*is an

*f*-divergence (Qiao & Minematsu, 2008). This isomorphism, as well as the monotonicity, employ

*f*-divergences for application in speech, signal, and pattern recognition (Basseville, 1988; Qiao & Minematsu, 2008). Finally, Cichocki et al. (2009), suggested a generalization of the

*f*-divergences

*D*. In that divergence,

_{f}*f*is no longer convex. It is proposed to be with

*c*=

_{f}*f*′(1) ≠ 0 and denoted as a generalized

*f*-divergence. As a consequence of this relaxation of the convexity condition, in the case of

*p*and ρ being probability densities, the first term vanishes, such that the usual form of

*f*-divergences is obtained. Thus, as a famous example, the Hellinger divergence (Taneja & Kumar, 2004) is with the generating function for . Acoording to Cichocki et al. (2009),

*D*(

_{H}*p*||ρ) is a properly defined

*f*-divergence only for probability densities

*p*and ρ.

*f*-divergences—the so-called α- divergences according to the definition given in Cichocki et al. (2009): with the generating

*f*-function and and α>0. In the limit α → 1 the generalized Kullback-Leibler divergence

*D*(see equation 2.4) is obtained. Further, Cichocki et al. (2009) state that β-divergences can be generated from α-divergences by applying the nonlinear transforms

_{GKL}In addition to the general properties of the *f*-divergences stated here, one can derive a characteristic behavior for the α-divergences directly from equation 2.18 depending on the choice of the parameter α (Minka, 2005). For α ≪ 0 the minimization of *D*_{α}(*p*||ρ) to estimate ρ(*x*) may exclude modes of the target *p*(*x*). Further, for α ⩽ 0, the α-divergence is zero-forcing (i.e., *p*(*x*) = 0 forces ρ(*x*) = 0), while for α ⩾ 1, it is zero-avoiding (i.e., ρ(*x*)>0 whenever *p*(*x*)>0). For α → ∞, ρ(*x*) covers *p*(*x*) completely, and the α-divergence is called inclusive in that case.

*p*. Lemma 1 can be used to write the generalized Rényi divergence in terms of the α-divergence:

^{5}For probability densities,

*D*

^{GR}_{α}(

*p*||ρ) reduces to the usual Rényi divergence (Renyi, 1961, 1970): The divergence

*D*

^{R}_{α}(

*p*||ρ) is based on the Rényi entropy via equation 2.13. The Rényi entropy fulfills the additivity property for independent probabilities

*p*and

*q*: Further, the entropy

*H*

^{R}_{α}(

*p*) is related to the Tsallis entropy (see equation 2.25) by which, however, has in consequence a different subadditivity property, for α ≠ 1.

### 2.3. -Divergences.

^{6}Called γ-divergences it is defined for positive measures ρ and

*p*as

*D*

_{γ}(

*p*||ρ) is invariant under scalar multiplication with positive constants

*c*

_{1}and

*c*

_{2}: The equation

*D*

_{γ}(

*p*||ρ) = 0 holds only if

*p*=

*c*· ρ (

*c*>0) in the case of positive measures. Yet for probability densities,

*c*= 1 is required. In contradiction to the

*f*-divergences, an isomorphism here can be stated for

*h*-transformations (see equation 2.15) which are more strictly assumed to be affine.

*p*, ρ, τ. Let

*p*be a distortion of ρ defined as a convex combination with a positive distortion measure δ: Further, a positive measure

*g*is denoted as δ-consistent if is sufficiently small for large α>0. If two positive measures ρ and τ are δ-consistent according to a distortion measure δ, then the Pythagorean relation approximately holds for ρ, τ and the distortion

*p*

_{ε}of ρ, with ν = max{ν

_{ρ}, ν

_{τ}}. This property implies the robustness of γ-divergences with respect to distortions according to the resulting approximation, and

*D*

_{γ}(

*p*

_{ε}||ρ) should be small because

*p*

_{ε}is assumed to be a distortion of ρ (Fujisawa & Eguchi, 2008).

*D*

_{γ}(ρ||

*p*) becomes the usual Kullback-Leibler divergence (see equation 2.5) with normalized densities

*D*(

_{CS}*p*||ρ) was introduced by Principe, Xu, and Fisher (2000) considering the Cauchy-Schwarz inequality for norms. It is based on the quadratic Rényi-entropy

*H*

^{R}_{2}(

*p*) from equation 2.29 (Jenssen, 2005). Obviously,

*D*(

_{CS}*p*||ρ) is symmetric. It is frequently applied for Parzen window estimation and is particularly suitable for spectral clustering as well as for related graph cut problems (Jenssen, Principe, Erdogmus, & Eltoft, 2006).

## 3. Derivatives of Divergences: A Functional Analytic Approach

In this section we provide the mathematical formalism of generalized derivatives for functionals *p* and ρ, known as Fré chet derivatives or functional derivatives. First, we briefly reconsider the theory of functional derivatives derivatives. Then we investigate the divergence classes within this framework. In particular, we explain their Fré chet derivatives.

### 3.1. Functional (Fréchet) Derivatives.

*L*be a functional mapping from a linear, functional Banach space

*B*to . Further, let

*B*be equipped with a norm ‖ · ‖, and

*f*,

*h*∈

*B*are two functionals. The Fréchet derivative of

*L*at point

*f*is formally defined as with linear in

*h*. The existence and continuity of the limit are equivalent to the existence and continuity of the derivative. (For a detailed introduction, see Kantorowitsch & Akilow, 1978.)

*L*is linear, then

*L*[

*f*+ ε

*h*] −

*L*[

*f*] = ε

*L*[

*h*] and, hence, . Further, an analogon of the chain rule known from differential calculus can be stated: let be a continuously differentiable mapping. We consider the functional Then the Fréchet derivative is determined by the derivative

*F*′ as can be seen from and use of the linear property of the integral operator.

This property motivates an important remark about divergences, which can be seen as special integral operators:

Let *L _{g}* be an integral operator

*L*[

_{g}*f*] = ∫

*F*(

_{g}*f*(

*x*))

*dx*depending on a fixed functional

*g*∈

*B*. Then the Fréchet derivative is determined by the integral kernel

*F*′

_{g}(

*f*(

*x*)) =

*Q*(

*g*(

*x*),

*f*(

*x*)) being a function in

*x*. Therefore, frequently the Fréchet derivative is simply identified with

*Q*(

*g*(

*x*),

*f*(

*x*)) and written as but keeping in mind its original interpretation as an integral kernel defining the integral operator. We will make use from this abbreviation in the following considering divergences as integral operators

*D*(

*p*||ρ) =

*L*[ρ] and write , also denoted here as Fré chet derivative, for simplicity.

_{p}Finally, we remark that the Fréchet derivative in finite-dimensional spaces reduces to the usual partial derivative. In particular, it is represented in coordinates by the Jacobi matrix. Thus, the Fréchet derivative is a generalization of the directional derivatives.

### 3.2. Fréchet Derivatives for the Different Divergence Classes.

We are now ready to investigate functional derivatives of divergences. In particular we focus on Fréchet derivatives.

#### 3.2.1. Bregman Divergences.

*p*− ρ), commonly used in many vector quantization algorithms, including the online variant of

*k*-means, SOMs, NG, and so on.

#### 3.2.2. *f*-Divergences.

*f*-divergences, equation 2.11, the Fréchet derivative is with . As a famous example, we get for the Hellinger divergence, equation 2.17, The subset of α-divergences, equation 2.18, can be handled by The related Tsallis divergence

*D*

^{T}_{α}, equation 2.22, leads to the derivative depending on the parameter α. The generalized Rényi divergences, equation 2.26, are treated according to which is reduced to in the case of the usual Rényi divergences, equation 2.28.

#### 3.2.3. -Divergences.

## 4. Divergence-Based Online Vector Quantization Using Derivatives

Supervised and unsupervised vector quantization frequently are described in terms of dissimilarities or distances. Suppose data are given as data vectors .

Here we focus on prototype-based vector quantization: data processing (clustering or classification) is realized using prototypes as representatives, whereby the dissimilarity between data points, as well as between data and prototypes, is determined by dissimilarity measures ξ (not necessarily fulfilling triangle inequality or symmetry restrictions).

Frequently, such algorithms somewhat optimize a cost function *E* depending on the dissimilarity between the data points and the prototypes; usually one has *E* = *E*(ξ(**v**_{i}**, w**_{k})) and *i* = 1, …, *N* the number of data and *k* = 1, …, *C* the number of prototypes. This cost function can be a variant of the usual classification error in supervised learning or modified mean squared error of the dissimilarities ξ(**v**_{i}**, w**_{k}).

If *E* = *E*(ξ(**v**_{i}**, w**_{k})) is differentiable with respect to ξ, and ξ is differentiable with respect to the prototype **w**, then a stochastic gradient minimization is a widely used optimization scheme for *E*. This methodology implies the calculation of the dissimilarity derivatives , which now has to be considered in light of the above functional analytic investigations for divergence measures (i.e., we replace the dissimilarity measure ξ by divergences).

Therefore, we now assume that the data vectors are discrete representations of continuous positive measures *p*(*x*) with *v _{i}* =

*p*(

*x*),

_{i}*i*= 1, …,

*N*as required for divergences. Such data may be spectra or other frequency data occurring in many kinds of application like remote-sensing data analysis, mass spectrometry, or signal processing. Thereby, the restriction

*v*∈ [0, 1] for positive measures can be fulfilled simply by dividing all data vectors by the maximum vector entry, taking over all vectors and vector components of the data set. In case of probability densities, a subsequent normalization to stress ‖

_{i}**v**‖

_{1}= 1 is required.

Further, we also identify the prototypes as discrete realizations of positive measures ρ(*x*). Then the derivative has to be replaced by the (abbreviated) Fréchet derivative in the continuous case (see remark 1), which reduces to usual partial derivatives in the discrete case. This is formally achieved by replacing *p* and ρ by their vectorial counterparts **v** and **w** in the formulas of the divergences provided in section 3.2 and further translating integrals into sums.

In the following, we give prominent examples of unsupervised and supervised vector quantization, which can be optimized by gradient methods using the framework already introduced.

### 4.1. Unsupervised Vector Quantization.

#### 4.1.1. Basic Vector Quantization.

*W*= {

**w**

_{k}}

_{Z}, such that data points are faithfully represented in terms of a dissimilarity measure ξ. Thereby,

*C*=

*card*(

*Z*) is the cardinality of the index set

*Z*. More formally, the data point

**v**is represented by this prototype

**w**

_{s(v)}minimizing the dissimilarity ξ(

**v, w**

_{k}): The aim of the algorithm is to distribute the prototypes in such a way that the quantization error is minimized. In its simplest form, basic vector quantization (VQ) leads to a (stochastic) gradient descent on

*E*with for prototype update of the winning prototype

_{VQ}**w**

_{s(v)}according to equation 4.1, also known as the online variant of LBG algorithm (

*C*–means; Linde, Buzo, & Gray, 1980; Zador, 1982). Here, ε is a small, positive value called the learning rate. As we see, update 4.3 takes into account the derivative of the dissimilarity measure ξ with respect to the prototype. Beside the common choice of ξ being the squared Euclidean distance, the choice is given to the user with the restriction of differentiability. Hence, here we are allowed to apply divergences using derivatives in the sense of Fréchet derivatives.

#### 4.1.2. Self-Organizing Maps and Neural Gas.

*Z*and denoting this strucure as

*A*, usually a regular grid. The resulting vector quantization scheme is the self-organizing map (SOM) introduced by Kohonen (1997). The respective cost function (in the variant of Heskes, 1999) is with the so-called neighborhood function and ‖

**r − r**′‖

_{A}is the distance in

*A*according to the topological structure.

*K*(σ) is a normalization constant depending on the neighborhood range σ. For this SOM, the mapping rule, equation 4.1, is modified to which yields in the limit σ → 0 the original mapping (see equation 4.1). The prototype update for all prototypes then is given as (Heskes, 1999) As above, the utilization of a divergence-based update is straightforward for SOM as well.

#### 4.1.3. Further Vector Quantization Approaches.

There exists a long list of other vector quantization approaches, like kernelized SOMs (Hulle, 2000, 2002a, 2002b), generative topographic mapping (GTM; Bishop, Svensén, & Williams, 1998), and soft topographic mapping (Graepel, Burger, & Obermayer, 1998), to name just a few. Most of them use the Euclidean metric and the respective derivatives for adaptation. Thus, the idea of divergence-based processing can be transferred to these in a similar manner.

*k*= 1, …,

*N*are uniquely associated with prototypes in the embedding space and

*W*= {

**w**

_{k}}

^{N}

_{k=1}. The dissimilarity in the embedding space usually is chosen to be the quadratic Euclidean metric. Further, a hypothesis about the topological structure of the data

**v**

_{k}to be embedded is formulated for the embedding space by defining a probability distribution for so-called sampling vectors . A cost function of XOM can be defined as with the mapping rule as pointed out in Bunte, Hammer, Villmann, Biehl, and Wismüller (2010). As in usual SOMs, the neighborhood cooperativeness is given in XOMs by a gaussian, with the data dissimilarity ξ

_{V}(

**v**

_{k}

**, v**

_{j}) defined as Euclidean distance in the original XOM. The update of the prototypes in the embedding space is obtained in complete analogy to SOM as As one can see, we can apply divergences to both ξ

_{V}and . In case of the latter, the prototype update, equation 4.13, has to be changed accordingly using the respective Fré chet derivatives.

### 4.2. Learning Vector Quantization.

Learning vector quantization (LVQ) is the supervised counterpart of basic VQ. Now the data to be learned are equipped with class information **c**_{v}. Suppose we have *K* classes; we define **c**_{v} ∈ [0, 1]^{K}. If ∑^{K}_{k=1}*c _{i}* = 1, the labeling is probabilistic, and possibilistic otherwise. In case of a probabilistic labeling with

**c**

_{v}∈ {0, 1}

^{K}, the labeling is called crisp.

We now briefly explore how divergences can be used for supervised learning. Again we start with the widely applied basic LVQ approaches and then outline the procedure for some more sophisticated methods without any claim of completeness.

#### 4.2.1. Basic LVQ Algorithms.

The basic LVQ schemes were invented by Kohonen (1997). For standard LVQ, a crisp data labeling is assumed. Further, the prototypes **w**_{j} with labels *y _{j}* correspond to the

*K*classes in such a way that at least one prototype is assigned to each class. For simplicity, we take exactly one prototype for each class. The task is to distribute the prototypes in such a manner that the classification error is reduced. The respective algorithms LVQ1 to LVQ3 are heuristically motivated.

**v, w**

_{j}). Beside some small modifications, the basic LVQ schemes LVQ1 to LVQ3 mainly consist of determination of the most proximate prototype(s)

**w**

_{s(v)}for given

**v**according to the mapping rule, equation 4.1, and subsequent adaptation. Depending on the agreement of

**c**

_{v}and

*y*

_{s(v)}the adaptation of the prototype(s) takes place according to and α = 1 iff

**c**

_{v}=

*y*

_{s(v)}, and α = −1 otherwise.

A popular generalization of these standard algorithms is the generalized LVQ (GLVQ) introduced by Sato and Yamada (1996). In GLVQ the classification error is replaced by a dissimilarity-based cost function that is closely related to the classification error but not identical to it.

**v**, with class label

**c**

_{v}, the two best matching prototypes with respect to the data metric ξ, usually the quadratic Euclidian, are determined: has minimum distance under the constraint that the class labels are identical: . The other best prototype, , has the minimum distance supposing the class labels are different: . Then the classifier function μ(

**v**) is defined as being negative in case of a correct classification. The value ξ

^{+}− ξ

^{−}yields the hypothesis margin of the classifier (Crammer, Gilad-Bachrach, Navot, & Tishby, 2002). Then the generalized LVQ (GLVQ) is derived as gradient descent on the cost function with respect to the prototypes. In each learning step, for a given data point, both and are adapted in parallel. Taking the derivatives and , we get for the updates and with the scaling factors The values ϵ

^{+}and ϵ

^{−}∈ (0, 1) are the learning rates.

Obviously the distance measure ξ could be replaced for all of these LVQ schemes by one of the introduced divergences. This offers a new possibility for information-theoretic learning in classification schemes, which differs from the previous approaches significantly. These earlier approaches stress the information-optimum class representation, whereas here, the expected information loss in terms of the applied divergence measure is optimized (Torkkola & Campbell, 2000; Torkkola, 2003; Villmann, Hammer, et al., 2008).

#### 4.2.2. Advanced Learning Vector Quantization.

Apart from the basic LVQ schemes, many more sophisticated prototype-based learning schemes are proposed for classification learning. Here we will restrict ourselves to approaches that can deal with probabilistic or possibilistic labeled training data (uncertain decisions) that are, in addition, related to the basic unsupervised and supervised vector quantization algorithms mentioned in this letter so far.

*E*

_{FL}measures the classification accuracy. The factor in β ∈ [0, 1) is a factor balancing unsupervised and supervised learning. The classification accuracy term

*E*

_{FL}is defined as where

*g*

_{γ}(

**v, w**

_{r}) is a gaussian kernel describing a neighborhood range in the data space using the dissimilarity ξ(

**v, w**

_{r}) in the data space. ψ(

**c**

_{v}

**,**

*y*

_{r}) judges the dissimilarities between label vectors of data and prototypes. ψ(

**c**

_{v}

**,**

*y*

_{r}) is originally suggested to be the quadratic Euclidean distance.

*E*

_{FL}depends on the dissimilarity in the data space ξ(

**v, w**

_{r}) via

*g*

_{γ}(

**v, w**

_{r}). Hence, prototype adaptation in FLSOM/FLNG is influenced by the classification accuracy which yields The label adaptation is influenced only by the second part,

*E*

_{FL}. The derivative yields with learning rate ϵ

_{l}>0 (Villmann, Schleif, et al., 2008; Villmann et al., 2006). This label learning leads to a weighted average

*y*

_{r}of the fuzzy labels

**c**

_{v}of those data

**v**, that are close to the associated prototypes according to ξ(

**v, w**

_{r}).

It should be noted at this point that a similar approach can easily be installed for XOM in an analog manner, yielding FLXOM.

Clearly, beside the possibility of choosing a divergence measure for ξ(**v, w**_{r}) as in the unsupervised case, there is no contradiction to do so for the label dissimilarity ψ(**c**_{v}**,***y*_{r}) in these FL methods. As before, the simple plug-in of the respective discrete divergence variants and their Fréchet derivatives modifies the algorithms such that semisupervised learning can proceed by relying on divergences for both variants.

## 5. SOM Simulations for Various Divergences

In this section, we demonstrate the influence of the chosen divergence and the dependence on divergence parameters for prototype-based unsupervised vector quantization. For this purpose, we consider an artificial but illustrative data set. In the case of parameterized divergences, we vary the parameter settings to show their dependence on the resulting prototype distribution. Further, we investigate the behavior of different divergence types but always comparing the results with Euclidean distance-based learning as the standard to show their differences.

These investigations for the toy problem should lead readers to think about the choice of divergences for a specific application as well as optimum parameter settings. The demonstration itself is far from a realistic scenario, which also has to deal with such matters as high-dimensional problems and heterogeneous data distributions.

As an example vector quantization model, we consider the Heskes-SOM according to equation 4.4 using a chain lattice with 100 units **r** and their prototypes . The example data distribution consists of 10^{7} data points **v =** (*v*_{1}, *v*_{2}) ∈ [0, 1]^{2}. which are constrained such that *v*_{1} + *v*_{2} = 1 (i.e., the data **v** can be taken as probability densities in ). Further, generating the data set, the first argument *v*_{1} is chosen randomly according to the data density *P*_{1}(*v*_{1}) = 2 · *v*_{1}, whereas *v*_{2} is subsequently calculated according to the constraint.

The learning rate ε as well as the neighborhood range σ converged during the SOM learning to the final values ε_{final} = 10^{−6} and σ_{final} = 1, respectively. The initial values for the learning rate ε as well as the neighborhood range σ were appropriately chosen.

We trained SOM networks for the divergences as introduced in section 2 using the Fréchet derivatives deduced in section 3.1 with different parameter values.

For the η-divergence (belonging to the Bregman divergences) the results are depicted in Figure 2. One can observe that the influence of the parameter η is only marginal. Yet small variations can be detected. For the special choice η = 2, Euclidean learning is realized.

For the β-divergence, the influence of the parameter value β is stronger than the parameter effect for η-divergences (see Figure 3). In particular, significant deviations can be observed for higher prototype *w*_{1}-values, giving a hint of a better discrimination property for this probability range. Lower prototype *w*_{1}-values were captured by the β-divergences markedly better than by the Euclidean learning.

The α-divergence based learning shows the inclusive and exclusive properties mentioned above. For a positive choice of the control parameter α the range of prototype *w*_{1}-values captured is quite larger than the one covered using negative α-values. However, only small variations can be detected within the two α-domains (positive and negative); that is, the divergence is relatively robust with respect to the control parameter α (see Figure 4).

For the Tsallis divergence, the influence of the control parameter α is already detected in the central range of prototype *w*_{1}-values and significant in the upper range (see Figure 5). Especially in comparison to Euclidean learning, this gives a hint of a quite good discrimination property for a wide probability range.

In contrast to the β-divergence, the influence of the control parameter α of the Rényi divergence is primarily detected in the region with sparse data density (see Figure 6). However, the Rényi divergence-based learning covers a wider range of prototype *w*_{1}-values than the Euclidean learning.

The γ-divergence shows the most sensitive behavior of all parameterized divergences investigated here (see Figure 7). In particular, the choice of the control parameter γ influences both ranges of probability—the low and the high one—with approximately the same sensitivity (see Figure 7).

Thus, it differs from the sensitivity observed for β-divergences. This behavior offers the possibility of tuning the divergence precisely depending on the specific vector quantization task. Together with stated robustness of the γ-divergence (Fujisawa & Eguchi, 2008), this adaptive specificity could provide a high potential for a wide range of application. This is underscored by the applications in supervised and unsupervised vector quantization based on the Cauchy-Schwarz divergence (γ = 1) (Jenssen et al., 2006; Mwebaze et al., 2010; Principe et al., 2000; Villmann, Haase, Schleif, & Hammer, 2010).

Figure 8 shows the results of the prototype-based unsupervised vector quantization using various nonparameterized divergences.

These simulations should be seen, on one hand, as a proof of concept. On the other hand, one can clearly see quite different behavior for the various divergences, resulting in distinguished prototype distributions. This leads, in consequence, to diverse vector quantization properties. Therefore, the choice of a divergence for a specific application should be made very carefully, taking the special properties of the divergences into account.

## 6. Extensions for the Basic Adaptation Scheme: Hyperparameter and Relevance Learning

### 6.1. Hyperparameter Learning for , and Divergences.

#### 6.1.1. Theoretical Considerations.

Considering the parameterized divergence families of γ-, α-, β-, and η-divergences, one could further think about the optimal choice of the so-called hyperparameters γ, α, β, η as suggested in a similar manner for other parameterized LVQ algorithms (Schneider, Biehl, & Hammer, 2009). In case of supervised learning schemes for classification based on differentiable cost functions, the optimization can be handled as an object of a gradient descent–based adaptation procedure. Thus, the parameter is optimized for the classification task at hand.

_{θ}with parameter θ. If

*E*and ξ

_{θ}are both differentiable with respect to θ according to a gradient-based optimization is derived by depending on the derivative for a certain choice of the divergence ξ

_{θ}.

We assume in the following that the (positive) measures *p* and ρ are continuously differentiable. Then, considering derivatives of parameterized divergences with respect to the parameter θ, it is allowed to interchange integration and differentiation if the resulting integral exists (Fichtenholz, 1964). Hence, we can differentiate parameterized divergences with respect to their hyperparameter in that case. For the several α-, β-, γ -, and η-divergences, characterized in section 2, we obtain after some elementary calculations:

- •
- •
- •
- •
- •
- •
- •

#### 6.1.2. Example: Hyperparameter Learning for γ-Divergences in GLVQ.

^{+}and θ

^{−}taken from equation 4.19.

For this purpose, we investigate a simple classification example: the well-known three-class IRIS data set. We rescaled the data vectors such that the requirements of positive measures are satisfied. We used two prototypes for each class and 10-fold cross-validation. We initialized the γ -parameter as γ_{0} = 0.5 to be in the middle between Kullback-Leibler and Cauchy-Schwarz divergence according to Mwebaze et al. (2010).

Without a γ-parameter update for γ = 0, a classification accuracy of 78.34% is obtained with standard deviation σ = 6.17, with the best result being 91.3%. For γ = 1, the average is 95.16%, σ = 1.87 with the best-performed run yielding 97.3%. The hyperparameter-controlled simulations give only a slight improvement achieving, an average performance of 95.89% but with decreased deviation σ = 0.43. The γ -parameter converged to γ_{final} = 0.9016 with standard deviation σ_{γ} < 10^{−4}. As expected from the noncontrolled experiments, the final γ-value is in the proximity of the Cauchy-Schwarz divergence. However, it is slightly but certainly decreased. A typical learning progress of γ is depicted in Figure 9. As for the Cauchy-Schwarz divergence (γ = 1), the best performance was 97.3% for the controlled case.

Summarizing, this small experiment shows that hyperparameter optimization works well and may lead to better performance and stability.

### 6.2. Relevance Learning for Divergences.

Density functions are required to fulfill the normalization condition, whereas positive measures are more flexible. This offers the possibility of transferring the idea of relevance learning to divergence-based learning vector quantization. Relevance learning in learning vector quantization is weighting the input data dimensions such that classification accuracy is improved (Hammer & Villmann, 2002).

*q*(

*x*) by λ(

*x*) with 0 ⩽ λ(

*x*) < ∞ and the regularization condition ∫λ(

*x*)

*dx*= 1. Incorporating this idea into the above approaches, we have to replace in the divergences

*p*by

*p*· λ and ρ by ρ · λ. Doing so, we can optimize λ(

*x*) during learning for better performance by gradient descent optimization as it is known from vectorial relevance learning. This leads, again, to Fréchet derivatives of the divergences but now with respect to the weighting function λ(

*x*). The respective framework based on GLVQ for vectorial data is given by the generalized relevance learning vector quatization scheme (GRLVQ; Hammer & Villmann, 2002). In complete analogy, we obtain the functional relevance update, with

*s*

^{+}(

*p*) and

*s*

^{−}(

*p*) playing the same role as in GLVQ. For vectorial representations

**v**and

**w**of

*p*and ρ, respectively, this reduces to the ordinary partial derivatives:

*f*-divergences, equation 2.11, we consider with using the fact that . The relevance learning of the subclass of α-divergences, equation 2.18, follows, whereas the respective gradient of generalized Rényi divergences, equation 2.26, can be derived from this as The subset of Tsallis divergences is treated by

## 7. Conclusion

Divergence-based supervised and unsupervised vector quantization has been done so far by applying only a few divergences, primarily Kullback-Leibler divergence. Recent applications also refer to Itakura-Saito divergence, Cauchy divergence, and γ-divergence. These approaches are not online adaptation schemes involving gradient learning but are based on batch mode, requiring all the data at one time. However, in many cases, online learning is mandatory, for several reasons: the huge amount of data, a subsequently inreasing data set, or the need for very careful learning in complex problems, for example (Alex, Hasenfuss, & Hammer, 2009). In these cases, online learning is required or may be at least, advantageous.

In this letter we give a mathematical foundation for gradient-based vector quantization bearing on the derivatives of the applied divergences. We provide a general framework for the use of arbitrary divergences and their derivatives such that they can immediately be plugged into existing gradient-based vector quantization schemes.

For this purpose, we first characterized the main subclasses of divergences—Bregman-, α-, β-, γ-, and *f* -divergences—following Cichocki et al. (2009). We then used the mathematical methodology of Fréchet derivatives to calculate the functional divergence derivatives.

We show how to use this methodology with famous examples of supervised and unsupervised vector quantization, including SOM, NG, and GLVQ. In particular, we explained that the divergences can be taken as suitable dissimilarity measures for data, which leads to the use of the respective Fréchet derivatives in the online learning schemes. Further, we declare how a parameter adaptation could be integrated in supervised learning to achieve improved classification results in case of the parameterized α -, β-, γ-, and η-divergences. In the last step, we considered a weighting function for generalized divergences based on a positive measure. The optimization scheme for this weight function is obtained by Fréchet derivatives again to obtain a relevance learning scheme in analogy to relevance learning in the usual supervised learning vector quantization (Hammer & Villmann, 2002).

As a proof of concept, the simulations for an illustrative example for the several parametric and nonparametric divergences give promising results regarding their sensitivity. The differences with Euclidean learning are obvious. Moreover, the dependencies in case of parameterized divergences give hints for possible real-world application, which should be the next step in this work.

## Appendix A: Calculation of the Derivatives of the Parameterized Divergences with Respect to the Hyperparameters

We assume for the differentiation of the divergences with respect to their hyperparameters that the (positive) measures *p* and ρ are continuously differentiable. Then, considering derivatives of divergences, integration and differentiation can be interchanged, if the resulting integral exists (Fichtenholz, 1964).

### A.1. -divergence.

### A.2. -Divergences.

### A.3. Rényi Divergences.

*D*

^{GR}_{α}(

*p*||ρ) from equation 2.26, we get with Summarizing the differentiation yields

*D*

^{R}_{α}(

*p*||ρ) from equation 2.28: We analogously achieve

### A.4. -Divergences.

## Appendix B: Proof of Lemma 1

We now give the proof of lemma 1. For the proof, we need a proposition given in Liese and Vajda (1987):

See Liese and Vajda (1987).

This proposition gives the essential ingredients to proof the lemma:

*p** be a nonnegative integrable function defined as Further, let us define Then it follows directly from the above proposition that there is such that which leads to

*p*and ρ are positive measures with weights

*W*(

*p*) ⩽ 1 and

*W*(ρ) ⩽ 1 according to equation 2.1, this finally yields which completes the proof of the lemma.

## Notes

^{1}

Each set of arbitrary nonnegative integrable functionals *f* with domain *V* can be transformed into a set of positive measures simply by with .

^{2}

If *S* follows a statistical distribution with existing functional expectation value *E _{S}*, then the mean μ can be replaced by

*E*.

_{S}^{3}

The relations and hold.

^{4}

The equality holds iff the conditional densities and are identical (see Amari & Nagaoka, 2000).

^{5}

A careful transformation of the parameter α is required for exact transformations between both divergences. For details, see Amari (1985) and Cichocki et al. (2009). Further, this statement was given in this book without proving the bounds of the underlying *f*-divergence for positive measures as it is given in this letter by lemma 1.

^{6}

The divergence *D*_{γ}(*p*||ρ) is proposed to be robust for γ ∈ [0, 1] with the existence of *D*_{γ=0} in the limit γ → 0. A detailed analysis of robustness is given in Fujisawa and Eguchi (2008).

## References

*l*norm for time series and its application to self-organizing maps

_{p}*f*-divergence is a generalized invariant measure between disributions

*f*-divergence, and information inequalities