## Abstract

This study considers the common situation in data analysis when there are few observations of the distribution of interest or the target distribution, while abundant observations are available from auxiliary distributions. In this situation, it is natural to compensate for the lack of data from the target distribution by using data sets from these auxiliary distributions—in other words, approximating the target distribution in a subspace spanned by a set of auxiliary distributions. Mixture modeling is one of the simplest ways to integrate information from the target and auxiliary distributions in order to express the target distribution as accurately as possible. There are two typical mixtures in the context of information geometry: the - and -mixtures. The -mixture is applied in a variety of research fields because of the presence of the well-known expectation-maximazation algorithm for parameter estimation, whereas the -mixture is rarely used because of its difficulty of estimation, particularly for nonparametric models. The -mixture, however, is a well-tempered distribution that satisfies the principle of maximum entropy. To model a target distribution with scarce observations accurately, this letter proposes a novel framework for a nonparametric modeling of the -mixture and a geometrically inspired estimation algorithm. As numerical examples of the proposed framework, a transfer learning setup is considered. The experimental results show that this framework works well for three types of synthetic data sets, as well as an EEG real-world data set.

## 1 Introduction

*and*all the auxiliary pdfs. Second, the set of mixture models, which is parameterized by , should be as simple as possible. This consideration is important to maintain generalization ability and avoid overfitting (Wang, Greiner, & Wang, 2013). The -mixture satisfies both of these characteristics, while the -mixture satisfies only the first. Hence, a set of -mixtures is simpler than that of -mixtures in terms of the principle of maximum entropy, as shown in section 3.3: -mixtures are included in the exponential family. Since entropy is a measure of ambiguity, a pdf with small entropy has mostly a condensed probability mass in specific small areas of the data space. Particularly in the problem of model estimation from finite samples, small entropy implies overfitting. The exponential family is known to be a natural solution to the maximum entropy problem. The principle of maximum entropy shows that the -mixture is the flattest or minimally informative pdf in a certain family of distributions (Cover & Thomas, 1991).

In some cases, the -mixture is a more natural modeling approach than the -mixture. For example, suppose distinct data sets are well approximated by gaussian distributions with different means and covariance values. In this case, it is natural to model the st data set with a gaussian distribution rather than a mixture of gaussians. Further, the gaussian distribution belongs to the exponential family, which is known to have the -mixture property, that is, an -mixture of gaussians becomes a gaussian. Thus, one might say that the -mixture model is a natural extension of the exponential family, which contains all the auxiliary pdfs (Akaho, 2004).

In a variety of research fields such as speaker verification (Douglas, Thomas, & Robert, 2000), background subtraction (Zivlovic, 2004), and genetic analysis (Ji, Wu, Liu, Wang, & Coombes, 2005), -mixtures are applied along with the EM algorithm (Dempster, Laird, & Rubin, 1977). However, few authors have applied -mixtures despite their good properties (Genest & Zidek, 1986), since estimating them can be computationally intractable because of the use of log and exponential functions. Indeed, an appropriate distribution family for auxiliary pdfs must be selected, which can be calculated in the -mixture form.

We construct our nonparametric -mixture estimation algorithm with the aid of two theorems: the characterization of the -mixture and the Pythagorean relation. In exploratory data analysis, it is preferable not to assume any specific form of probability distribution behind the data in advance, and therefore nonparametric approaches are often preferred. In addition, when we consider a parametric -mixture, both auxiliary and target distributions must be restricted to belong to a certain family of distributions to ensure the feasibility of calculating the -mixture. To ensure that the modeling is flexible, we aim to estimate the -mixture in a nonparametric manner.

The remainder of this letter is organized as follows. In section 2, we introduce the information geometry required to explain our approach. The detailed problem addressed in this letter is described in section 3, and our proposed framework is explained in section 4. Section 5 presents the experimental results by using both artificial and real-world data sets, and the last section is devoted to a discussion and conclusion.

## 2 Preliminary on Information Geometry

Information geometry is a framework for discussing the mechanisms of statistical inference or machine learning by focusing on the geometrical structure of the manifold of probability distributions. In this section, we discuss some basic ideas of information geometry, focusing on the Pythagorean relation derived from the Kullback–Leibler (KL) divergence.

### 2.1 KL Divergence

### 2.2 Geodesics and Flatness

As opposed to the Euclidean space, a statistical space is curved and distorted in general. Theorem ^{1} induces two types of geodesics in .

The -geodesic of two arbitrary pdfs in is included in . The -flat subspace is specified by a set of pdfs that spans the subspace; however, when it is clear from the context or when there is no need to specify, we omit the set of pdfs and simply denote the -flat subspace by . The -geodesic and the -flat subspace are derived in the same way as the -geodesic and the -flat subspace.

We note that it is possible to define the -flat subspace by allowing ’s to include negative values, because the product of exponential functions is positive. In this letter, however, we define the -flat subspace as all the internal points of the subspace spanned by a finite number of pdfs; that is, the ’s are defined as nonnegative values to simplify the argument.

### 2.3 Projection

The -projection to an -flat subspace and the -projection to an -flat subspace are uniquely determined (Amari, 2016).

### 2.4 Mixture Models and Their Characterization

## 3 Nonparametric -Mixture

In this section, to consider the problem of nonparametric -mixture estimation from the viewpoint of information geometry, we define and restate the notions introduced in the previous section in the nonparametric setting.

### 3.1 Problem Formulation

Figure 4 shows a conceptual diagram of our framework in parametric and nonparametric manners from the viewpoint of information geometry (Amari, 1991, 2016; Amari & Nagaoka, 2000). The curved surface with the solid lines in the left panel of Figure 4 shows the subspace of a certain family of pdfs parameterized by . In the conventional parametric mixture estimation setting, each parameter of the model is estimated based on each data set , and this procedure is regarded as the projection of empirical distribution for onto the subspace . Then the mixture ratio is updated by minimizing the sum of divergences between and , namely, the projection of onto .

Figure 5 illustrates two ways of estimating the parametric -mixture: the conventional method that uses the gradient descent method and the proposed method we introduce in section 4. Suppose we have two auxiliary data sets and and target data sets . When we consider gaussian distributions as the probabilistic models of those data sets, we obtain the parameters of these gaussian distributions , , and from , , and , respectively. The goal of parametric -mixture estimation is to obtain the mixture ratio , which minimizes the KL divergence between the target pdf and the -mixture . As noted above, we can consider two methods to estimate the optimal mixture ratio : the gradient descent method and the proposed Pythagorean relation-based method, introduced in the next section. In the gradient descent method, we minimize the KL divergence by using its gradient with respect to the mixture ratio . In the one-dimensional gaussian case, we can derive the gradient of the KL divergence as shown in appendix B. The left panels of Figure 5 show the results by using the gradient descent method. The upper panel shows the estimated mixture ratios and by iterations. The dotted lines indicate the ground-truth mixture ratio , . The lower panel shows the parameter space spanned by and . Every point marked by indicates the estimated parameter at each iteration. We can see that the estimated point approaches as the iterations proceed. The right panels in Figure 5 show the results by using the proposed method based on the Pythagorean relation. As shown in Figure 5, the proposed method can estimate the mixture ratio as well. Since the proposed method requires only KL divergence, we need not calculate the complicated differentials of the KL divergence.

### 3.2 Nonparametric -Mixture Modeling

Finally, we note that equation 3.4 does not indicate the -mixture of the auxiliary pdfs because each weight in equation 3.4 contains implicitly. Since the - and -mixtures have different restrictions with respect to , restrictions on weight are also different depending on the mixture models. Equation 3.4 is the -representation of the -mixture, and we develop an algorithm to optimize the weight vector in equation 3.4 for the -mixture estimation.

### 3.3 Requisite Conditions for the -Mixture

As described in the section 3.2, the gradient descent method cannot be used for the nonparametric -mixture estimation because of the abuse of the -representation of the nonparametric mixture model. Therefore, we consider a geometrical algorithm that only needs the KL divergence. To estimate the -mixture in a nonparametric manner, the following two conditions are imposed:

is the -mixture of auxiliary empirical distributions .

is the projection of onto the subspace .

Let us consider the first condition. According to theorem ^{6} (characterization of the -mixture), the pdf that minimizes the weighted KL divergence is written in the form of the -mixture. Therefore, if we obtain the weight of in equation 2.11, which minimizes the weighted KL divergence in equation 2.12, is regarded as the -mixture of the auxiliary pdfs with the given mixture ratios .

### 3.4 Nonparametric KL Divergence Estimator

### 3.5 Why the -Mixture?

This theorem shows that the pdf that belongs to an exponential family satisfies the principle of maximum entropy. Thus, the -mixture also satisfies the principle of maximum entropy. On the contrary, we now consider the problem of representing a mixture of the auxiliary pdfs as a pdf in the exponential family. The most natural definition of for including all auxiliary pdfs in this exponential form is , where is constant. This form can be easily derived when considering the mixture ratios of all zeros but a single one. Therefore, the -mixture is the most natural pdf in the form of exponential families, which can represent all auxiliary pdfs .

## 4 Algorithm for the Nonparametric -Mixture

To estimate the -mixture of distributions that approximate the target distribution, we first explore a general algorithmic framework. Considering the two conditions denoted in section 3.3, it is natural to find from algorithm 1. In the parametric setting, the projection of on the subspace spanned by the auxiliary distribution and the calculation of the mixture parameter for are both straightforward. However, in our nonparametric setting, this task is difficult in general. To overcome this difficulty, we derive a specific nonparametric -mixture estimation algorithm by using the techniques introduced in sections 2 and 3.1.

Since the -mixture of the nonparametric models is determined by the weight and the mixture ratio , we denote the -mixture by when it is clearer. Step 1 of the proposed algorithm computes the weight given a fixed mixture ratio , and step 2 computes the mixture ratio given a fixed weight . These two steps are computed alternately until converges. For the initialization, we start with uniform weights and uniform ratios .

### 4.1 Step 1

In step 1, the -mixture is estimated by using a given mixture ratio based on theorem ^{6}:

*Step 1:*Compute the weight that minimizes the weighted KL divergence by using a fixed mixture ratio : By substituting the estimator defined by equation 3.12 into equation 4.1, we obtain the objective function: where denotes the uniform weight. In other words, , By denoting equation 4.2 is written as We obtain the weight vector by minimizing equation 4.3 according to the gradient projection method. Since the objective function is discontinuous because of the index in , we evaluate using the weight obtained in the previous iteration. That is, with some abuse of the notation, the gradient at iteration is approximated as , and the updating formula is given by where is the learning rate and is the projection operator for the weight , that is, . When we estimate the entropy and the cross-entropy by using equations 3.10 and 3.11, the parameter is set to be small to reduce the bias. Then we can assume that for most of , a small change in does not change the distance and that the derivatives of with respect to , if they exist, are zero. For , such that the derivative of does not exist, a small change in causes a jump in the value of . The size of the jump is of the order of the distance between two data points, and it is divided by , the sum of the distances. Hence, we expect that the resultant absolute value is small, and the value is multiplied by . From this consideration, we omit the term from the gradient of the objective function. We can construct a pathological example that the jump is too large to ignore, but in our experiments, the objective function monotonically decreased by the approximated gradient descent.

We set in our experiment, where is the parameter that determines the updating speed. We then run this gradient projection method until converges.

### 4.2 Step 2

We next determine the mixture ratio so that becomes the projection of onto the subspace . Figure 6 shows the geometrical interpretation of step 2. If is the projection of onto , the dotted line between and and the solid line between and are orthogonal: namely , , and satisfy the Pythagorean relation. Based on this geometrical intuition, we update the mixture ratio according to the violation of the Pythagorean relation , which takes 0 when the mixture ratio is optimal. The two top triangles in Figure 6 show the relations among , , and . The upper left panel shows the case of the acute-angled triangle, where is positive. In this case, is smaller than optimal and should be closer to . Conversely, the upper right panel shows the case of the obtuse-angled triangle, where is larger than optimal.

To reflect this geometrical comprehension, we introduce a weakly increasing piecewise linear function of , as shown in Figure 7. The function controls , reflecting the degree of the violation of the Pythagorean relation . The function increases (decreases) when satisfies (). By using this function , we update as follows.

*Step 2:*Update the mixture ratio according to the violation of the Pythagorean relation: where is estimated by the estimator (3.2) as and is defined as where is a positive constant.

Note that the function satisfies . In our experiments, we use , , and . After updating for all , they are normalized so that . These two steps search for the closest -mixture of the auxiliary pdfs from by assigning weights for the samples in the set . Since we assume that the weight vector satisfies the definition of the probability distributions, our proposed algorithm eventually provides a way in which to sample from the set .

The hyperparameter is introduced to control the update speed in equation 4.4, and is used to control the penalty for violating the Pythagorean relation in equation 4.7; both were tuned in our preliminary experiments.

As for the computational cost of the proposed algorithm, the most time-consuming part of the algorithm is estimating the KL divergence by using equation 3.12, which requires sorting data points times, amounting to computation. Note that this is mainly the cost for sorting, and it is required only in the proposed algorithm. We close the section with the pseudocode of the nonparametric -mixture estimation algorithm in algorithm 2.

## 5 Experiments

We conducted a set of experiment on three synthetic and one real-world data set to evaluate the proposed algorithm. In the real-world data set, we considered the situation where we have only a few samples from the target pdf, and our algorithm is used for data augmentation in the classification problem. In all the experiments, to reduce the computational cost of estimating the KL divergence, we sampled half the number of data points from , where is redefined by the sampled subset of the original.

### 5.1 Synthetic Data

#### 5.1.1 Simple Setup

The top panels of Figure 8A show the data sets , , and 2000 points sampled from . The bottom panels show , uniformly weighted empirical distribution , , and the result of the nonparametric -mixture estimation, respectively. The size of each mark in the bottom panels represents the weight of each sample.

The estimated by iterations is shown in Figure 8B. The horizontal dotted lines are the true values of the mixture ratio . In this experiment, the underlying distributions for both target and auxiliary are gaussians, which are characterized by the means and the covariance matrices. Figure 8C shows the contours of the empirical covariance matrices centered at the empirical means, which are estimated by using the obtained weights, to see the behavior of the estimates with the progress of the algorithm. Each ellipse includes 90% of the probability mass for the gaussian distribution. The solid ellipse expresses the original gaussian distributions and . The solid gray ellipse is the ground-truth -mixture , while all the dotted ellipses express the estimated -mixtures by iterations.

These experimental results suggest that the proposed algorithm successfully approximates the ground-truth distribution for the target data set as the nonparametric -mixture constructed from the given data sets with the appropriate weights.

#### 5.1.2 The -Mixture of the pdfs Represented by the Gaussian Mixtures

In the second experiment with a synthetic data set, we assess how the proposed algorithm performs in the specific case where the auxiliary pdfs are nongaussian. In particular, we consider the case where the auxiliary pdfs are multimodal, represented by the -mixture of gaussians. We use two five-component GMMs as the auxiliary pdfs. We obtain two data sets and , which each have 2500 points sampled from the five-component GMM, which is not included in the exponential family; the data points of these two data sets form an S-shape. In this experiment, we use the -mixture of these two GMMs as the target pdf. The left panel of Figure 9 shows the data sets (marked ), (marked ), and (marked ), which are sampled from the target pdf. The points of the target data set were sampled as follows. Initially we created two five-component GMMs: . Detailed values of the parameters are shown in appendix C. Their density outputs are shown in panels a and b in Figure 9, respectively. Then we obtain the -mixture of GMMs , which is shown in panel c, in the form of the density function with the mixture ratio , that is, . We sampled the data points from by using the rejection sampling method.