Abstract

This study considers the common situation in data analysis when there are few observations of the distribution of interest or the target distribution, while abundant observations are available from auxiliary distributions. In this situation, it is natural to compensate for the lack of data from the target distribution by using data sets from these auxiliary distributions—in other words, approximating the target distribution in a subspace spanned by a set of auxiliary distributions. Mixture modeling is one of the simplest ways to integrate information from the target and auxiliary distributions in order to express the target distribution as accurately as possible. There are two typical mixtures in the context of information geometry: the - and -mixtures. The -mixture is applied in a variety of research fields because of the presence of the well-known expectation-maximazation algorithm for parameter estimation, whereas the -mixture is rarely used because of its difficulty of estimation, particularly for nonparametric models. The -mixture, however, is a well-tempered distribution that satisfies the principle of maximum entropy. To model a target distribution with scarce observations accurately, this letter proposes a novel framework for a nonparametric modeling of the -mixture and a geometrically inspired estimation algorithm. As numerical examples of the proposed framework, a transfer learning setup is considered. The experimental results show that this framework works well for three types of synthetic data sets, as well as an EEG real-world data set.

1  Introduction

Constructing a mixture model of probability distributions (Everitt & Hand, 1981; McLachlan & Peel, 2000) is a standard approach for integrating information from different sources and representing the presence of different subpopulations underlying the overall population. In this letter, we present a nonparametric -mixture estimation method, namely, an algorithm to estimate a logarithmic mixture of nonparametric, models for solving a variant of user adaptation or transfer learning (Blum & Mitchell, 1998; Pan & Yang, 2010). Our task is to construct a good model of the target data set when it includes an insufficient number of examples by using auxiliary data sets that have a sufficient amount of relevant information. This problem is considered to be one of approximating the target distribution in a subspace spanned by a set of auxiliary distributions. In these situations, mixture modeling is a popular method for estimating the target distribution. There are two typical mixtures in the context of information geometry (Amari & Nagaoka, 2000): the -mixture and the -mixture. The -mixture is a convex combination of auxiliary probability density functions (pdfs) ,
formula
1.1
where is a mixture ratio vector of the pdfs , and is the number of the pdfs. The gaussian mixture model (GMM) is an example of the -mixture (McLachlan & Peel, 2000). The -mixture has the following form,
formula
1.2
where is the normalization term. In the -mixture form, the pdfs are combined by the weighted arithmetic average. On the contrary, in the -mixture form, a weighted average of log densities is used. Figure 1 shows the - and the -mixtures of two gaussian distributions. The solid lines indicate the two gaussian distributions, and the dashed and dotted lines indicate the - and the -mixtures of these two gaussian distributions, respectively, with a uniform mixture ratio. The difference between the - and the -mixtures is understood as an analogy between logical OR and AND. The idea of the -mixture is also related to the classical mixture of experts (Jordan & Jacobs, 1994; Heskes, 1998; Choi, Choi, & Choe, 2013). To motivate -mixture modeling, we present two important characteristics for mixture models. First, the set of models should contain all the auxiliary pdfs; indeed, it is natural to require the mixture model to contain (or well approximate) the target pdf and all the auxiliary pdfs. Second, the set of mixture models, which is parameterized by , should be as simple as possible. This consideration is important to maintain generalization ability and avoid overfitting (Wang, Greiner, & Wang, 2013). The -mixture satisfies both of these characteristics, while the -mixture satisfies only the first. Hence, a set of -mixtures is simpler than that of -mixtures in terms of the principle of maximum entropy, as shown in section 3.3: -mixtures are included in the exponential family. Since entropy is a measure of ambiguity, a pdf with small entropy has mostly a condensed probability mass in specific small areas of the data space. Particularly in the problem of model estimation from finite samples, small entropy implies overfitting. The exponential family is known to be a natural solution to the maximum entropy problem. The principle of maximum entropy shows that the -mixture is the flattest or minimally informative pdf in a certain family of distributions (Cover & Thomas, 1991).
Figure 1:

An example of the - and -mixture of two gaussian distributions.

Figure 1:

An example of the - and -mixture of two gaussian distributions.

In some cases, the -mixture is a more natural modeling approach than the -mixture. For example, suppose distinct data sets are well approximated by gaussian distributions with different means and covariance values. In this case, it is natural to model the st data set with a gaussian distribution rather than a mixture of gaussians. Further, the gaussian distribution belongs to the exponential family, which is known to have the -mixture property, that is, an -mixture of gaussians becomes a gaussian. Thus, one might say that the -mixture model is a natural extension of the exponential family, which contains all the auxiliary pdfs (Akaho, 2004).

In a variety of research fields such as speaker verification (Douglas, Thomas, & Robert, 2000), background subtraction (Zivlovic, 2004), and genetic analysis (Ji, Wu, Liu, Wang, & Coombes, 2005), -mixtures are applied along with the EM algorithm (Dempster, Laird, & Rubin, 1977). However, few authors have applied -mixtures despite their good properties (Genest & Zidek, 1986), since estimating them can be computationally intractable because of the use of log and exponential functions. Indeed, an appropriate distribution family for auxiliary pdfs must be selected, which can be calculated in the -mixture form.

In this letter, from the viewpoint of information geometry, we propose a novel framework for estimating the -mixtures of nonparametric models. Suppose we are given a set of observed data points . A naive approach for nonparametric -mixture modeling would be to replace the auxiliary distributions in equation 1.2 with empirical distributions. However, as shown in section 3.1, the substitution of empirical distributions in the logarithmic function is prohibited. Instead, we consider a nonparametric representation of -mixture models by using a weighted empirical distribution function of all the given data,
formula
1.3
where is a weight vector—that is, each element of represents the sampling probability of a datum , and is the Dirac delta function. When a sufficient number of data are given, nonparametric models such as an empirical distribution function can express the underlying distribution more precisely than parametric models. We can control the empirical distribution function, equation 1.3, by changing the weights of the data points. Here, our objective is to determine the weight of all the given data so that the empirical distribution, equation 1.3, becomes the -mixture of the auxiliary distributions.

We construct our nonparametric -mixture estimation algorithm with the aid of two theorems: the characterization of the -mixture and the Pythagorean relation. In exploratory data analysis, it is preferable not to assume any specific form of probability distribution behind the data in advance, and therefore nonparametric approaches are often preferred. In addition, when we consider a parametric -mixture, both auxiliary and target distributions must be restricted to belong to a certain family of distributions to ensure the feasibility of calculating the -mixture. To ensure that the modeling is flexible, we aim to estimate the -mixture in a nonparametric manner.

The remainder of this letter is organized as follows. In section 2, we introduce the information geometry required to explain our approach. The detailed problem addressed in this letter is described in section 3, and our proposed framework is explained in section 4. Section 5 presents the experimental results by using both artificial and real-world data sets, and the last section is devoted to a discussion and conclusion.

2  Preliminary on Information Geometry

Information geometry is a framework for discussing the mechanisms of statistical inference or machine learning by focusing on the geometrical structure of the manifold of probability distributions. In this section, we discuss some basic ideas of information geometry, focusing on the Pythagorean relation derived from the Kullback–Leibler (KL) divergence.

2.1  KL Divergence

We consider a statistical space composed of arbitrary pdfs , where is a random variable. A point in the space corresponds to a pdf. We also consider a subspace composed of pdfs , where is a parameter of the pdf. For example, is composed of mean and variance when we consider a family of gaussian distributions. The parameter plays the role of a coordinate system in this subspace . The problem of statistical inference is reduced to the problem of searching for the closest point on the subspace from the given data, where closeness is measured by using a certain divergence function. A schematic diagram of these notions of information geometry is shown in Figure 2. The KL divergence (Kullback & Leibler, 1951) is an example of the divergence between two probability distributions and , which is defined as
formula
2.1
Let us consider a relationship based on the KL-divergence among three pdfs, , , and :
formula
2.2
When the right-hand side of equation 2.2 equals zero, and can be regarded as orthogonal vectors in the statistical space , as shown in Figure 3.
Figure 2:

Schematic diagram of the statistical inference.

Figure 2:

Schematic diagram of the statistical inference.

Figure 3:

Geometrical view of the Pythagorean relation.

Figure 3:

Geometrical view of the Pythagorean relation.

Theorem 1
(Pythagorean relation) (Amari & Nagaoka, 2000). Let , , and be probability density functions. If and are orthogonal, namely, , then the Pythagorean relation holds:
formula
2.3

2.2  Geodesics and Flatness

As opposed to the Euclidean space, a statistical space is curved and distorted in general. Theorem 1 induces two types of geodesics in .

Definition 1
(-geodesic and -flat subspace). Let and be probability density functions. The -geodesic is defined as the set of internal divisions between and parameterized by :
formula
2.4
Every internal division is also a pdf because its integral is equal to one. The -flat subspace is defined by the set generated from pdfs by the -mixture
formula
2.5

The -geodesic of two arbitrary pdfs in is included in . The -flat subspace is specified by a set of pdfs that spans the subspace; however, when it is clear from the context or when there is no need to specify, we omit the set of pdfs and simply denote the -flat subspace by . The -geodesic and the -flat subspace are derived in the same way as the -geodesic and the -flat subspace.

Definition 2
(-geodesic and -flat subspace). Let and be probability density functions. The -geodesic is defined as the set of internal divisions between and parameterized by :
formula
2.6
where is the normalization term defined as
formula
The -flat subspace is defined by the set generated from pdfs by -mixture:
formula
2.7
where is a normalization constant. The -geodesic of two arbitrary pdfs in is included in .

We note that it is possible to define the -flat subspace by allowing ’s to include negative values, because the product of exponential functions is positive. In this letter, however, we define the -flat subspace as all the internal points of the subspace spanned by a finite number of pdfs; that is, the ’s are defined as nonnegative values to simplify the argument.

2.3  Projection

Let , , and be pdfs in . When the -geodesic connecting and is orthogonal at to the -geodesic connecting and , the Pythagorean relation holds:
formula
When the Pythagorean relation holds, for any in also holds because of the nonnegativity of the KL divergence. Therefore, the intersection point of two geodesics becomes the minimizer of the KL divergence between and an arbitrary point in . The pdf uniquely exists, and it is known as the projection.
Definition 3
(-projection and -projection). The -projection from a point to an -flat subspace is given by finding the closest point from :
formula
2.8
Similarly, the -projection from a point to an -flat subspace is given by finding the closest point from :
formula
2.9

The -projection to an -flat subspace and the -projection to an -flat subspace are uniquely determined (Amari, 2016).

2.4  Mixture Models and Their Characterization

In information geometry, points in a subspace can be represented in different coordinate systems called the - and -representations. We can consider two types of mixtures of pdfs in the -representation and the -representation. Mixture models are regarded as flat subspaces spanned by a finite number of pdfs. The following two theorems characterize the -mixture and the -mixture of pdfs, respectively: Let , be pdfs and be the associated mixture ratios, where
formula
2.10
Theorem 2
(characterization of the -mixture) (Murata & Fujimoto, 2009). For any sum of the KL divergence weighted by is minimized at the -mixture:
formula
2.11
where is the set of probability density functions.
Theorem 3
(characterization of the -mixture) (Murata & Fujimoto, 2009). For any , the sum of the KL divergence weighted by is minimized at the -mixture:
formula
2.12
where is the set of probability density functions.

The proofs of theorems 5 and 6 are given in appendixes A and B.

3  Nonparametric -Mixture

In this section, to consider the problem of nonparametric -mixture estimation from the viewpoint of information geometry, we define and restate the notions introduced in the previous section in the nonparametric setting.

3.1  Problem Formulation

Suppose we have a target data set composed of samples generated from a probability distribution with a pdf . We also have auxiliary data sets . The th data set is composed of samples generated from a probability distribution with a pdf . We consider the situation that the target data set has many fewer data than the auxiliary data sets, and we wish to obtain a more feasible estimate of , taking advantage of the informative auxiliary data sets. The situation is often seen, for example, in classification problems of EEG (Tu & Sun, 2011) and audio signals (Sturim, Reynolds, Singer, & Campbell, 2001), and formulated as the transfer learning problem. We consider representing the target pdf as an -mixture of other auxiliary pdfs by weighting the data in the auxiliary data sets. In order for the auxiliary pdfs to be informative to express the target pdf, the support of the target and auxiliary pdfs must have sufficiently overlapped. To facilitate the discussion, we assume
formula
3.1
where is a support of a function .

Figure 4 shows a conceptual diagram of our framework in parametric and nonparametric manners from the viewpoint of information geometry (Amari, 1991, 2016; Amari & Nagaoka, 2000). The curved surface with the solid lines in the left panel of Figure 4 shows the subspace of a certain family of pdfs parameterized by . In the conventional parametric mixture estimation setting, each parameter of the model is estimated based on each data set , and this procedure is regarded as the projection of empirical distribution for onto the subspace . Then the mixture ratio is updated by minimizing the sum of divergences between and , namely, the projection of onto .

Figure 4:

(Left) The curved surface with the solid lines represents the subspace parameterized by . Typical algorithms such as the EM algorithm work on this surface. (Right) Conversely, our algorithm works on the curved surface with the dotted lines, representing the subspace of the -mixtures of empirical distributions .

Figure 4:

(Left) The curved surface with the solid lines represents the subspace parameterized by . Typical algorithms such as the EM algorithm work on this surface. (Right) Conversely, our algorithm works on the curved surface with the dotted lines, representing the subspace of the -mixtures of empirical distributions .

Figure 5 illustrates two ways of estimating the parametric -mixture: the conventional method that uses the gradient descent method and the proposed method we introduce in section 4. Suppose we have two auxiliary data sets and and target data sets . When we consider gaussian distributions as the probabilistic models of those data sets, we obtain the parameters of these gaussian distributions , , and from , , and , respectively. The goal of parametric -mixture estimation is to obtain the mixture ratio , which minimizes the KL divergence between the target pdf and the -mixture . As noted above, we can consider two methods to estimate the optimal mixture ratio : the gradient descent method and the proposed Pythagorean relation-based method, introduced in the next section. In the gradient descent method, we minimize the KL divergence by using its gradient with respect to the mixture ratio . In the one-dimensional gaussian case, we can derive the gradient of the KL divergence as shown in appendix B. The left panels of Figure 5 show the results by using the gradient descent method. The upper panel shows the estimated mixture ratios and by iterations. The dotted lines indicate the ground-truth mixture ratio , . The lower panel shows the parameter space spanned by and . Every point marked by indicates the estimated parameter at each iteration. We can see that the estimated point approaches as the iterations proceed. The right panels in Figure 5 show the results by using the proposed method based on the Pythagorean relation. As shown in Figure 5, the proposed method can estimate the mixture ratio as well. Since the proposed method requires only KL divergence, we need not calculate the complicated differentials of the KL divergence.

Figure 5:

An illustrative example of two estimation methods for a parametric -mixture estimation. (Left) The estimation results from using the gradient descent method. (Right) The estimation results from using the proposed algorithm based on the Pythagorean relation. This figure shows that both methods provide the correct mixture ratio.

Figure 5:

An illustrative example of two estimation methods for a parametric -mixture estimation. (Left) The estimation results from using the gradient descent method. (Right) The estimation results from using the proposed algorithm based on the Pythagorean relation. This figure shows that both methods provide the correct mixture ratio.

3.2  Nonparametric -Mixture Modeling

In contrast to the -mixture, the -mixture is a nonlinear combination of the pdfs. In general, obtaining the closed-form solution for the -mixture of distributions is impossible, even in the parametric framework. Furthermore, nonparametric modeling for the mixture of distributions is desirable when we have no prior knowledge on the form of data distribution. The problem of the nonparametric -mixture estimation is more difficult than its parametric counterpart, as explained later. We denote the whole data set as
formula
3.2
and let be the number of data points in . In this letter, regarding that the weighted data set gives a nonparametric representation of a pdf with fixed and a parameter of weighted empirical distributions, we optimize the parameter so that equation 1.3 represents the -mixture of the entire data set for the nonparametric expression of the target pdf.
We construct an empirical distribution from the target data set as . Similarly, we construct auxiliary empirical distributions from the auxiliary data sets as . The -mixture of these auxiliary empirical distributions can be written, with an abuse of the delta functions, as
formula
3.3
where is the mixture ratio vector of the auxiliary pdfs . Expression (3.3) is not mathematically formal because of the log of delta functions, another source of difficulty in nonparametric -mixture modeling. Thus, we define a mixture model that satisfies equation 2.12 of theorem 3.4 as an -mixture of the nonparametric models. Given a mixture parameter , from theorem 3.4, we obtain a nonparametric -mixture as . As the set of pdfs in theorem 3.4, in which the -mixture is found, we use the set of pdfs parameterized by the weights as
formula
3.4
A weight vector specifies a point in , and the point is determined by equation 2.11. Although the weight vector depends on the mixture ratio , for simplicity, we write the weight vector instead of .
On the contrary, given a weight vector for all the data points in , we aim to optimize the mixture parameter so that the -mixture defined by equation 2.11 is a good approximation of the target distribution . The curved surface with the dotted lines in the right panel of Figure 4 shows a subspace of the -mixtures of the empirical distributions, which is defined by
formula
3.5
A mixture parameter specifies a point in . Let the projection of onto be , namely,
formula
Since we optimize so that the is closest to and since is specified by , the optimal depends on the given . For simplicity, we write the mixture parameter instead of . We perform our proposed -mixture estimation algorithm on this surface, as is typical in the conventional parametric -mixture estimation method, by satisfying the two requisite conditions for the -mixture described in section 3.3.

Finally, we note that equation 3.4 does not indicate the -mixture of the auxiliary pdfs because each weight in equation 3.4 contains implicitly. Since the - and -mixtures have different restrictions with respect to , restrictions on weight are also different depending on the mixture models. Equation 3.4 is the -representation of the -mixture, and we develop an algorithm to optimize the weight vector in equation 3.4 for the -mixture estimation.

3.3  Requisite Conditions for the -Mixture

As described in the section 3.2, the gradient descent method cannot be used for the nonparametric -mixture estimation because of the abuse of the -representation of the nonparametric mixture model. Therefore, we consider a geometrical algorithm that only needs the KL divergence. To estimate the -mixture in a nonparametric manner, the following two conditions are imposed:

  1. is the -mixture of auxiliary empirical distributions .

  2. is the projection of onto the subspace .

Let us consider the first condition. According to theorem 6 (characterization of the -mixture), the pdf that minimizes the weighted KL divergence is written in the form of the -mixture. Therefore, if we obtain the weight of in equation 2.11, which minimizes the weighted KL divergence in equation 2.12, is regarded as the -mixture of the auxiliary pdfs with the given mixture ratios .

For the second condition, we consider the subspace , which includes . Our objective is to find the closest pdf from in the sense of the KL divergence. Then any pdf satisfies theorem 2.1 (Pythagorean relation) when we consider , and in equation 2.3. Moreover, because each auxiliary pdf is in , the following equation holds:
formula
3.6
The subspaces and are those to which belongs. The former is the search space for given from equation 2.11 (condition 1), while the latter is the search space for given the weight (condition 2). To find the optimal in and , a nonparametric KL divergence estimator is thus required.

3.4  Nonparametric KL Divergence Estimator

Divergence estimators based on the -nearest neighbor method have been widely investigated (Wang, Kulkarni, & Verdú, 2005, 2009), and such methods have been extended to deal with weighted observations (Hino & Murata, 2013). Suppose we are given two weighted data sets and , whose empirical distributions are expressed by equation 1.3. Now we denote the index of the th nearest point from an inspection point in by (). Then we define the quantile of with respect to the inspection point by . Conversely, when the quantile is specified, the point in , where , is called the -quantile point of . Let be the Euclidean distance between the inspection point and its -quantile point in , and let -ball be the hypersphere of radius centered at . The probability mass of the -ball centered at is denoted by
formula
3.7
We obtain the following approximation formula by using Taylor’s expansion of the integrand in equation 3.7,
formula
3.8
where is the volume of the unit ball in and is the gamma function. Now, the probability density at the inspection point is estimated from the above expression as
formula
3.9
From this density estimator, we can obtain an estimator of Shannon’s differential entropy as
formula
3.10
where is a renormalized weighted data set excluding . When we have two weighted data sets, and , a cross-entropy estimator is written as
formula
3.11
By specifying the quantile , we can estimate the KL divergence in a nonparametric manner:
formula
3.12

3.5  Why the -Mixture?

Before deriving the proposed algorithm, we emphasize our motivation behind estimating the -mixture. The major reason we consider the -mixture is that it satisfies the principle of maximum entropy (Cover & Thomas, 1991). Let be a set of pdfs satisfying
formula
3.13
formula
3.14
formula
3.15
where are certain vector-valued functions and are moments with respect to the functions .
Theorem 4
(maximum entropy and -mixture) (Murata & Fujimoto, 2009). Let be the maximum entropy function in , which is defined by
formula
3.16
Then the probability density function in equation 3.16 is written in the form of an exponential family as
formula
3.17

This theorem shows that the pdf that belongs to an exponential family satisfies the principle of maximum entropy. Thus, the -mixture also satisfies the principle of maximum entropy. On the contrary, we now consider the problem of representing a mixture of the auxiliary pdfs as a pdf in the exponential family. The most natural definition of for including all auxiliary pdfs in this exponential form is , where is constant. This form can be easily derived when considering the mixture ratios of all zeros but a single one. Therefore, the -mixture is the most natural pdf in the form of exponential families, which can represent all auxiliary pdfs .

4  Algorithm for the Nonparametric -Mixture

To estimate the -mixture of distributions that approximate the target distribution, we first explore a general algorithmic framework. Considering the two conditions denoted in section 3.3, it is natural to find from algorithm 1. In the parametric setting, the projection of on the subspace spanned by the auxiliary distribution and the calculation of the mixture parameter for are both straightforward. However, in our nonparametric setting, this task is difficult in general. To overcome this difficulty, we derive a specific nonparametric -mixture estimation algorithm by using the techniques introduced in sections 2 and 3.1.

Since the -mixture of the nonparametric models is determined by the weight and the mixture ratio , we denote the -mixture by when it is clearer. Step 1 of the proposed algorithm computes the weight given a fixed mixture ratio , and step 2 computes the mixture ratio given a fixed weight . These two steps are computed alternately until converges. For the initialization, we start with uniform weights and uniform ratios .

4.1  Step 1

In step 1, the -mixture is estimated by using a given mixture ratio based on theorem 6:

formula
Step 1: Compute the weight that minimizes the weighted KL divergence by using a fixed mixture ratio :
formula
4.1
By substituting the estimator defined by equation 3.12 into equation 4.1, we obtain the objective function:
formula
4.2
where denotes the uniform weight. In other words, , By denoting
formula
equation 4.2 is written as
formula
4.3
We obtain the weight vector by minimizing equation 4.3 according to the gradient projection method. Since the objective function is discontinuous because of the index in , we evaluate using the weight obtained in the previous iteration. That is, with some abuse of the notation, the gradient at iteration is approximated as , and the updating formula is given by
formula
4.4
where is the learning rate and is the projection operator for the weight , that is, . When we estimate the entropy and the cross-entropy by using equations 3.10 and 3.11, the parameter is set to be small to reduce the bias. Then we can assume that for most of , a small change in does not change the distance and that the derivatives of with respect to , if they exist, are zero. For , such that the derivative of does not exist, a small change in causes a jump in the value of . The size of the jump is of the order of the distance between two data points, and it is divided by , the sum of the distances. Hence, we expect that the resultant absolute value is small, and the value is multiplied by . From this consideration, we omit the term from the gradient of the objective function. We can construct a pathological example that the jump is too large to ignore, but in our experiments, the objective function monotonically decreased by the approximated gradient descent.

We set in our experiment, where is the parameter that determines the updating speed. We then run this gradient projection method until converges.

4.2  Step 2

We next determine the mixture ratio so that becomes the projection of onto the subspace . Figure 6 shows the geometrical interpretation of step 2. If is the projection of onto , the dotted line between and and the solid line between and are orthogonal: namely , , and satisfy the Pythagorean relation. Based on this geometrical intuition, we update the mixture ratio according to the violation of the Pythagorean relation , which takes 0 when the mixture ratio is optimal. The two top triangles in Figure 6 show the relations among , , and . The upper left panel shows the case of the acute-angled triangle, where is positive. In this case, is smaller than optimal and should be closer to . Conversely, the upper right panel shows the case of the obtuse-angled triangle, where is larger than optimal.

Figure 6:

Step 2 of the algorithm. Each empirical distribution is on the curved surface with the dotted lines, which is the subspace of the -mixtures of the empirical distributions . In this space , each data set , and the weighted empirical distribution , forms triangles. To make the projection of onto , this triangle satisfies the Pythagorean relation.

Figure 6:

Step 2 of the algorithm. Each empirical distribution is on the curved surface with the dotted lines, which is the subspace of the -mixtures of the empirical distributions . In this space , each data set , and the weighted empirical distribution , forms triangles. To make the projection of onto , this triangle satisfies the Pythagorean relation.

To reflect this geometrical comprehension, we introduce a weakly increasing piecewise linear function of , as shown in Figure 7. The function controls , reflecting the degree of the violation of the Pythagorean relation . The function increases (decreases) when satisfies (). By using this function , we update as follows.

Figure 7:

The weakly increasing piecewise linear function .

Figure 7:

The weakly increasing piecewise linear function .

Step 2: Update the mixture ratio according to the violation of the Pythagorean relation:
formula
4.5
where is estimated by the estimator (3.2) as
formula
4.6
and is defined as
formula
4.7
where is a positive constant.

Note that the function satisfies . In our experiments, we use , , and . After updating for all , they are normalized so that . These two steps search for the closest -mixture of the auxiliary pdfs from by assigning weights for the samples in the set . Since we assume that the weight vector satisfies the definition of the probability distributions, our proposed algorithm eventually provides a way in which to sample from the set .

The hyperparameter is introduced to control the update speed in equation 4.4, and is used to control the penalty for violating the Pythagorean relation in equation 4.7; both were tuned in our preliminary experiments.

As for the computational cost of the proposed algorithm, the most time-consuming part of the algorithm is estimating the KL divergence by using equation 3.12, which requires sorting data points times, amounting to computation. Note that this is mainly the cost for sorting, and it is required only in the proposed algorithm. We close the section with the pseudocode of the nonparametric -mixture estimation algorithm in algorithm 2.

formula

5  Experiments

We conducted a set of experiment on three synthetic and one real-world data set to evaluate the proposed algorithm. In the real-world data set, we considered the situation where we have only a few samples from the target pdf, and our algorithm is used for data augmentation in the classification problem. In all the experiments, to reduce the computational cost of estimating the KL divergence, we sampled half the number of data points from , where is redefined by the sampled subset of the original.

5.1  Synthetic Data

5.1.1  Simple Setup

First, we used synthetic data to demonstrate how our proposed algorithm works. We show that our nonparametric -mixture estimation algorithm works when the underlying distributions are gaussian. Suppose we have a set of auxiliary data sets . Each data set has 2000 data points sampled from a gaussian distribution . We are also given a target data set , which contains data points of size , from the -mixture of the auxiliary pdfs . The mean vector and covariance matrix of the -mixture are calculated by
formula
5.1
and
formula
5.2
respectively. For illustration purposes, we consider a two-component two-dimensional () gaussian mixture. Our aim is to generate 2000 points from the nonparametric -mixture distribution constructed from , , and . We set the quantile value to in equation 3.12, the updating speed to in equation 4.4 to facilitate the convergence, and the coefficient to in equation 4.7. The value of the quantile is determined so that it contains 5 to 10 data points from an inspection point when the weight is uniform.

The top panels of Figure 8A show the data sets , , and 2000 points sampled from . The bottom panels show , uniformly weighted empirical distribution , , and the result of the nonparametric -mixture estimation, respectively. The size of each mark in the bottom panels represents the weight of each sample.

Figure 8:

(A) Top panels show the scatter plots of the data sets , and 2000 points sampled from . Bottom panels plot the target data set , uniformly weighted empirical distribution , and estimated . (B) Mixture ratios and by iterations. (C) Estimated gaussian distributions by iterations.

Figure 8:

(A) Top panels show the scatter plots of the data sets , and 2000 points sampled from . Bottom panels plot the target data set , uniformly weighted empirical distribution , and estimated . (B) Mixture ratios and by iterations. (C) Estimated gaussian distributions by iterations.

The estimated by iterations is shown in Figure 8B. The horizontal dotted lines are the true values of the mixture ratio . In this experiment, the underlying distributions for both target and auxiliary are gaussians, which are characterized by the means and the covariance matrices. Figure 8C shows the contours of the empirical covariance matrices centered at the empirical means, which are estimated by using the obtained weights, to see the behavior of the estimates with the progress of the algorithm. Each ellipse includes 90% of the probability mass for the gaussian distribution. The solid ellipse expresses the original gaussian distributions and . The solid gray ellipse is the ground-truth -mixture , while all the dotted ellipses express the estimated -mixtures by iterations.

These experimental results suggest that the proposed algorithm successfully approximates the ground-truth distribution for the target data set as the nonparametric -mixture constructed from the given data sets with the appropriate weights.

5.1.2  The -Mixture of the pdfs Represented by the Gaussian Mixtures

In the second experiment with a synthetic data set, we assess how the proposed algorithm performs in the specific case where the auxiliary pdfs are nongaussian. In particular, we consider the case where the auxiliary pdfs are multimodal, represented by the -mixture of gaussians. We use two five-component GMMs as the auxiliary pdfs. We obtain two data sets and , which each have 2500 points sampled from the five-component GMM, which is not included in the exponential family; the data points of these two data sets form an S-shape. In this experiment, we use the -mixture of these two GMMs as the target pdf. The left panel of Figure 9 shows the data sets (marked ), (marked ), and (marked ), which are sampled from the target pdf. The points of the target data set were sampled as follows. Initially we created two five-component GMMs: . Detailed values of the parameters are shown in appendix C. Their density outputs are shown in panels a and b in Figure 9, respectively. Then we obtain the -mixture of GMMs , which is shown in panel c, in the form of the density function with the mixture ratio , that is, . We sampled the data points from by using the rejection sampling method.

Figure 9:

(Left) The data sets , , and 500 points sampled from the -mixture of the five-component GMM . The right panels show the density outputs of (a) ; (b) ; (c) the density of ; (d) , which is sampled by rejection sampling from the -mixture of GMMs; (e) the result of the nonparametric -mixture estimation; (f) the result of the proposed algorithm. We use the quantile value , updating speed , and the parameter for the violation of the Pythagorean relation .

Figure 9:

(Left) The data sets , , and 500 points sampled from the -mixture of the five-component GMM . The right panels show the density outputs of (a) ; (b) ; (c) the density of ; (d) , which is sampled by rejection sampling from the -mixture of GMMs; (e) the result of the nonparametric -mixture estimation; (f) the result of the proposed algorithm. We use the quantile value