## Abstract

Divergence estimators based on direct approximation of density ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is a challenging task in practice. In this letter, we use relative divergences for distribution comparison, which involves approximation of relative density ratios. Since relative density ratios are always smoother than corresponding ordinary density ratios, our proposed method is favorable in terms of nonparametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposedapproach.

## 1. Introduction

Comparing probability distributions is a fundamental task in statistical data processing. It can be used for outlier detection (Smola, Song, & Teo, 2009; Hido, Tsuboi, Kashima, Sugiyama, & Kanamori, 2011), two-sample homogeneity test (Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2007; Sugiyama, Suzuki, Itoh, Kanamori, & Kimura, 2011), and transfer learning (Shimodaira, 2000; Sugiyama, Krauledat, & Müller, 2007), for example.

*p*(

**) and would be to estimate a divergence from**

*x**p*(

**) to , such as the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951): A naive way to estimate the KL divergence is to separately approximate the densities**

*x**p*(

**) and from data and plug the estimated densities into the above definition. However, since density estimation is known to be a hard task (Vapnik, 1998), this approach does not work well unless a good parametric model is available. Recently a divergence estimation approach has been proposed that directly approximates the density ratio, without going through separate approximation of densities**

*x**p*(

**) and (Sugiyama et al., 2008; Nguyen, Wainwright, & Jordan, 2010). Such density-ratio approximation methods were proved to achieve the optimal nonparametric convergence rate in the mini-max sense.**

*x**p*(

**) to is defined as The PE divergence is a squared-loss variant of the KL divergence, and both belong to the class of the Ali-Silvey-Csiszár divergences (which is also known as the**

*x**f*-divergences; see Ali & Silvey, 1966; Csiszár, 1967). Thus, the PE and KL divergences share similar properties; they are nonnegative and vanish if and only if .

Similar to the KL divergence estimation, the PE divergence can also be accurately estimated based on density-ratio approximation (Kanamori, Hido, & Sugiyama, 2009): the density-ratio approximator called unconstrained least-squares importance fitting (uLSIF) gives the PE divergence estimator analytically, which can be computed by solving a system of linear equations. The practical usefulness of the uLSIF-based PE divergence estimator was demonstrated in various applications such as outlier detection (Hido et al., 2011), two-sample homogeneity test (Sugiyama et al., 2011), and dimensionality reduction (Suzuki & Sugiyama, 2010).

In this letter, we first establish the nonparametric convergence rate of the uLSIF-based PE divergence estimator that elucidates its superior theoretical properties. However, it also reveals that its convergence rate is actually governed by the sup-norm of the true density-ratio function: max_{x}*r*(** x**). This implies that in the region where the denominator density takes small values, the density ratio tends to take large values, and therefore the overall convergence speed becomes slow. More critically, density ratios can diverge to infinity under a rather simple setting, as when the ratio of two gaussian functions is considered (Cortes, Mansour, & Mohri, 2010). This makes the paradigm of divergence estimation based on density-ratio approximation unreliable.

*p*(

**) to the -mixture density for . For example, the -relative PE divergence is given by We estimate the -relative divergence by direct approximation of the -relative density-ratio:**

*x*A notable advantage of this approach is that the -relative density ratio is always bounded above by when , even when the ordinary density ratio is unbounded. Based on this feature, we theoretically show that the -relative PE divergence estimator based on -relative density-ratio approximation is more favorable than the ordinary density-ratio approach in terms of nonparametric convergence speed.

We further prove that under a correctly specified parametric setup, the asymptotic variance of our -relative PE divergence estimator does not depend on the model complexity. This means that the proposed -relative PE divergence estimator barely overfits even with complex models.

Through extensive experiments on outlier detection, a two-sample homogeneity test, and transfer learning, we demonstrate that our proposed -relative PE divergence estimator compares favorably with alternative approaches.

The rest of this letter is structured as follows. In section 2, our proposed relative PE divergence estimator is described. In section 3, we providenonparametric analysis of the convergence rate and parametric analysis of the variance of the proposed PE divergence estimator. In section 4, we experimentally evaluate the performance of the proposed method on various tasks. In section 5, we conclude by summarizing our contributions and describing future prospects.

## 2. Estimation of Relative Pearson Divergence via Least-SquaresRelative Density-Ratio Approximation

In this section, we propose an estimator of the relative Pearson (PE) divergence based on least-squares relative density-ratio approximation.

### 2.1. Problem Formulation.

*d*-dimensional distribution

*P*with density

*p*(

**) and i.i.d. samples from another**

*x**d*-dimensional distribution with density : The goal of this letter is to compare the two underlying distributions

*P*and using only the two sets of samples, and .

*p*(

**) and : Let be the -relative density ratio of**

*x**p*(

**) and : We define the -relative PE divergence from**

*x**p*(

**) to as where denotes the expectation of**

*x**f*(

**) under**

*x**p*(

**): When , is reduced to the ordinary PE divergence. Thus, the -relative PE divergence can be regarded as a smoothed extension of the ordinary PE divergence.**

*x*Below, we give a method for estimating the -relative PE divergence based on the approximation of the -relative density ratio.

### 2.2. Direct Approximation of α-Relative Density Ratios.

Here, we describe a method for approximating the -relative density ratio, equation 2.1.

*n*-dimensional vector with the th element, It is easy to confirm that the solution of equation 2.4 can be analytically obtained as where

*I*_{n}denotes the

*n*-dimensional identity matrix. Finally, a density-ratio estimator is given as

When , the above method is reduced to a direct density-ratio estimator called unconstrained least-squares importance fitting (uLSIF; Kanamori et al., 2009). Thus, the method can be regarded as an extension of uLSIF to the -relative density ratio. For this reason, we refer to our method as relative uLSIF (RuLSIF).

The performance of RuLSIF depends on the choice of the kernel function (the kernel width in the case of the gaussian kernel) and the regularization parameter . Model selection of RuLSIF is possible based on cross-validation with respect to the squared-error criterion *J*, in the same way as the original uLSIF (Kanamori et al., 2009).

### 2.3. α-Relative PE Divergence Estimation Based on RuLSIF.

### 2.4. Illustrative Examples.

Here, we numerically illustrate the behavior of RuLSIF, equation 2.6, using toy data sets. Let the numerator distribution be *P*=*N*(0, 1), where denotes the normal distribution with mean and variance . The denominator distribution is set as follows:

:

*P*and are the same.: has smaller standard deviation than

*P*.: has larger standard deviation than

*P*.:

*P*and have different means.: contains an additional component to

*P*.

We draw samples from the above densities and compute RuLSIF for , 0.5, and 0.95.

Figure 1 shows the true densities, true densityratios, and their estimates by RuLSIF. As can be seen from the graphs, the profiles of the true -relative densityratios get smoother as increases. In particular, in data sets b and d, the true density ratios for diverge to infinity, while those for and 0.95 are bounded (by ). Overall, as gets large, the estimation quality of RuLSIF tends to be improved since the complexity of the true density-ratio functions is reduced.

Note that in data set a where, , the true density ratio does not depend on since for any . However, the estimated density ratios still depend on through the matrix (see equation 2.5).

## 3. Theoretical Analysis

In this section, we analyze the theoretical properties of the proposed PE divergence estimators. More specifically, we provide nonparametric analysis of the convergence rate in section 3.1 and parametric analysis of the estimation variance in section 3.2. Since our theoretical analysis is highly technical, we focus on explaining the practical insights we can gain from the theoretical results here. We describe all the mathematical details of the nonparametric convergence rate analysis in appendix A and the parametric variance analysis in appendix B.

### 3.1. Nonparametric Convergence Analysis.

First, we elucidate thenonparametric convergence rate of the proposed PE estimators. Here, we practically regard the function space as an infinite-dimensional reproducing kernel Hilbert space (RKHS; Aronszajn, 1950) such as the gaussian kernel space and as the associated RKHS norm.

#### 3.1.1. Theoretical Results.

#### 3.1.2. Interpretation.

The above results also show that and have different asymptotic convergence rates. The leading term in equation 3.2 is of order , while the leading term in equation 3.3 is of order , which is slightly slower (depending on the complexity ) than . Thus, would be more accurate than in large sample cases. Furthermore, when , holds, and thus *c*=0 holds (see equation 3.4). Then the leading term in equation 3.2 vanishes, and therefore has an even faster convergence rate of order , which is slightly slower (depending on the complexity ) than . Similarly, if is close to 1, and thus holds.

When is not large enough to be able to neglect the terms of , the terms of matter. If and are large (this can happen, e.g., when is close to 0), the coefficient of the -term in equation 3.2 can be larger than that in equation 3.3. Then would be more favorable than in terms of approximation accuracy.

#### 3.1.3. Numerical Illustration.

Let us numerically investigate the above interpretation using the same artificial data set as in section 2.4.

Figure 2 shows the mean and standard deviation of and over 100 runs for , 0.5, and 0.95, as functions of *n* ( in this experiment). The true (which was numerically computed) is also plotted in the graphs. The graphs show that both estimators, and , approach the true as the number of samples increases, and the approximation error tends to be smaller if is larger.

When is large, tends to perform slightly better than . When is small and the number of samples is small, slightly compares favorably with . Overall, these numerical results agree well with our theory.

### 3.2. Parametric Variance Analysis.

Next, we analyze the asymptotic variance of the PE divergence estimator , equation 2.7, under a parametric setup.

#### 3.2.1. Theoretical Results.

*b*is a finite number. Here we assume that the above parametric model is correctly specified; it includes the true relative density-ratio function : there exists such that Here, we use RuLSIF without regularization, that is, in equation 3.1.

#### 3.2.2. Interpretation.

Equation 3.5 shows that up to , the variance of depends on only the true relative density ratio , not on the estimator of . This means that the model complexity does not affect the asymptotic variance. Therefore, overfitting would hardly occur in the estimation of the relative PE divergence even when complex models are used. We note that the above superior property is applicable only to relative PE divergence estimation, not to relative density-ratio estimation. This implies that overfitting occurs in relative density-ratio estimation, but the approximation error cancels out in relative PE divergence estimation.

Equation 3.6 shows that the variance can be upper-bounded by the quantity depending on , which is monotonically lowered if is reduced. Since monotonically decreases as increases, our proposed approach of estimating the -relative PE divergence (with ) would be more advantageous than the naive approach of estimating the plain PE divergence (which corresponds to ) in terms of parametric asymptotic variance.

#### 3.2.3. Numerical Illustration.

*k*+1. The -relative density ratio can be expressed using the ordinary density ratio as Thus, when

*k*>1, the above model includes the true relative density ratio of data sets b and c. We test RuLSIF with and 0.8 for model 3.8 with degree . The parameter is learned so that equation 3.1 is minimized by a quasi-Newton method.

The standard deviations of and for data sets b and c are depicted in Figures 3 and 4, respectively. The graphs show that the degree of models does not significantly affect the standard deviation of (i.e., no overfitting), as long as the model includes the true relative density ratio (i.e., *k*>1). On the other hand, bigger models tend to produce larger standard deviations in . Thus, the standard deviation of more strongly depends on model complexity.

## 4. Experiments

In this section, we experimentally evaluate the performance of the proposed method in a two-sample homogeneity test, outlier detection, and transfer learning tasks.

### 4.1. Two-Sample Homogeneity Test.

First, we apply the proposed divergence estimator to two-sample homogeneity test.

#### 4.1.1. Divergence-Based Two-Sample Homogeneity Test.

Given two sets of samples and , the goal of the two-sample homogeneity test is to test the null hypothesis that the probability distributions *P* and are the same against its complementary alternative (i.e., the distributions are different).

By using an estimator of some divergence between the two distributions *P* and , the homogeneity of two distributions can be tested based on the permutation test procedure (Efron & Tibshirani, 1993) as follows:

- •
Obtain a divergence estimate using the original data sets and .

- •
Randomly permute the samples, and assign the first samples to a set and the remaining samples to another set .

- •
Obtain a divergence estimate using the randomly shuffled data sets and (note that since and can be regarded as being drawn from the same distribution, tends to be close to zero).

- •
Repeat this random shuffling procedure many times, and construct the empirical distribution of under the null hypothesis that the two distributions are the same.

- •
Approximate the p-value by evaluating the relative ranking of the original in the distribution of .

When an asymmetric divergence such as the KL divergence (Kullback & Leibler, 1951) or PE divergence (Pearson, 1900) is adopted for two-sample homogeneity test, the test results depend on the choice of directions: a divergence from *P* to or from to *P*. Sugiyama et al. (2011) proposed choosing the direction that gives a smaller *p*-value. It was experimentally shown that when the uLSIF-based PE divergence estimator is used for the two-sample homogeneity test (which is called the least-squares two-samplehomogeneity test, LSTT), the heuristic of choosing the direction with a smaller *p*-value contributes to reducing type 2 errors (the probability of accepting incorrect null hypotheses: two distributions are judged to be the same when they are actually different), while the increase of *type* 1 *errors* (the probability of rejecting correct null hypotheses: two distributions are judged to be different when they are actually the same) is kept moderate.

Below, we refer to LSTT with as the plain LSTT, LSTT with as the reciprocal LSTT, and LSTT with heuristically choosing the one with a smaller p-value as the adaptive LSTT.

#### 4.1.2. Artificial Data Sets.

We illustrate how the proposed method behaves in two-sample homogeneity test scenarios using artificial data sets a to d described in section 2.4. We test the plain LSTT, reciprocal LSTT, and adaptive LSTT for , 0.5, and 0.95, with a significance level of 5%.

The experimental results are shown in Figure 5. For data set a, where (i.e., the null hypothesis is correct), the plain LSTT and reciprocal LSTT correctly accept the null hypothesis with probability approximately 95%. This means that type 1 error is properly controlled in these methods. The adaptive LSTT tends to give slightly lower acceptance rates than 95% for this toy data set, but the adaptive LSTT with still works reasonably well. This implies that the heuristic of choosing the method with a smaller *p*-value does not have a critical influence on type 1 errors.

In data sets b, c, and d, *P* is different from (the null hypothesis is not correct), and thus we want to reduce the acceptance rate of the incorrect null hypothesis as much as possible. In the plain setup for data set b and the reciprocal setup for data set c, the true density-ratio functions with diverge to infinity, and thus larger makes the density-ratio approximation more reliable. However, does not work well because it produces an overly smoothed density-ratio function, and thus it is hard to distinguish from the completely constant density-ratio function (which corresponds to ). In the reciprocal setup for data set b and the plain setup for data set c, small performs poorly since density-ratio functions with large can more accurately be approximated than those with small (see Figure 1). In the adaptive setup, large tends to perform slightly better than small for data sets b and c.

In data set d, the true density-ratio function with diverges to infinity for both the plain and reciprocal setups. In this case, middle performs the best, which well balances the trade-off between high distinguishability from the completely constant density-ratio function (which corresponds to ) and easy approximability. The same tendency that middle works well can also be mildly observed in the adaptive LSTT for data set d.

Overall, if the plain LSTT (or the reciprocal LSTT) is used, small (or large ) sometimes works excellently. However, it performs poorly in other cases, and thus the performance is unstable depending on the true distributions. The plain LSTT (or the reciprocal LSTT) with middle tends to perform reasonably well for all data sets. On the other hand, the adaptive LSTT was shown to nicely overcome the instability problem when is small or large. However, when is set to be a middle value, both the plain LSTT and the reciprocal LSTT give similar results, and thus the adaptive LSTT provides only small improvement.

Our empirical finding is that if we have prior knowledge that one distribution has wider support than the other distribution, assigning the distribution with a wider support to and setting to be a large value seem to work well. If there is no knowledge on the true distributions or twodistributions have fewer overlapped supports, using middle in the adaptive setup seems to be a reasonable choice.

We will systematically investigate this issue using more complex data sets.

#### 4.1.3. Benchmark Data Sets.

Here, we apply the proposed two-sample homogeneity test to the binary classification data sets taken from the IDA repository (Rätsch, Onoda, & Müller, 2001).

We test the adaptive LSTT with the RuLSIF-based PE divergence estimator for , 0.5, and 0.95; we also test the maximum mean discrepancy (MMD; Borgwardt et al., 2006), a kernel-based two-sample homogeneity test method. The performance of MMD depends on the choice of the gaussian kernel width. Here, we adopt a version proposed by Sriperumbudur, Fukumizu, Gretton, Lanckriet, and Schölkopf (2009), which automatically optimizes the gaussian kernel width. The *p*-values of MMD are computed in the same way as LSTT based on the permutation test procedure.

First, we investigate the rate of accepting the null hypothesis when the null hypothesis is correct (i.e., the two distributions are the same). We split the positive training samples into two sets and perform a two-sample homogeneity test for the two sets of samples. The experimental results are summarized in Table 1, showing that the adaptive LSTT with compares favorably with those with and 1 and MMD in terms of type 1 error.

. | . | . | . | LSTT . | LSTT . | LSTT . |
---|---|---|---|---|---|---|

Data Sets . | d
. | . | MMD . | () . | () . | () . |

Banana | 2 | 100 | 0.96(0.20) | 0.93(0.26) | 0.92(0.27) | 0.92(0.27) |

Thyroid | 5 | 19 | 0.96(0.20) | 0.95(0.22) | 0.95(0.22) | 0.88(0.33) |

Titanic | 5 | 21 | 0.94(0.24) | 0.86(0.35) | 0.92(0.27) | 0.89(0.31) |

Diabetes | 8 | 85 | 0.96(0.20) | 0.87(0.34) | 0.91(0.29) | 0.82(0.39) |

Breast-Cancer | 9 | 29 | 0.98(0.14) | 0.91(0.29) | 0.94(0.24) | 0.92(0.27) |

Flare-Solar | 9 | 100 | 0.93(0.26) | 0.91(0.29) | 0.95(0.22) | 0.93(0.26) |

Heart | 13 | 38 | 1.00(0.00) | 0.85(0.36) | 0.91(0.29) | 0.93(0.26) |

German | 20 | 100 | 0.99(0.10) | 0.91(0.29) | 0.92(0.27) | 0.89(0.31) |

Ringnorm | 20 | 100 | 0.97(0.17) | 0.93(0.26) | 0.91(0.29) | 0.85(0.36) |

Waveform | 21 | 66 | 0.98(0.14) | 0.92(0.27) | 0.93(0.26) | 0.88(0.33) |

. | . | . | . | LSTT . | LSTT . | LSTT . |
---|---|---|---|---|---|---|

Data Sets . | d
. | . | MMD . | () . | () . | () . |

Banana | 2 | 100 | 0.96(0.20) | 0.93(0.26) | 0.92(0.27) | 0.92(0.27) |

Thyroid | 5 | 19 | 0.96(0.20) | 0.95(0.22) | 0.95(0.22) | 0.88(0.33) |

Titanic | 5 | 21 | 0.94(0.24) | 0.86(0.35) | 0.92(0.27) | 0.89(0.31) |

Diabetes | 8 | 85 | 0.96(0.20) | 0.87(0.34) | 0.91(0.29) | 0.82(0.39) |

Breast-Cancer | 9 | 29 | 0.98(0.14) | 0.91(0.29) | 0.94(0.24) | 0.92(0.27) |

Flare-Solar | 9 | 100 | 0.93(0.26) | 0.91(0.29) | 0.95(0.22) | 0.93(0.26) |

Heart | 13 | 38 | 1.00(0.00) | 0.85(0.36) | 0.91(0.29) | 0.93(0.26) |

German | 20 | 100 | 0.99(0.10) | 0.91(0.29) | 0.92(0.27) | 0.89(0.31) |

Ringnorm | 20 | 100 | 0.97(0.17) | 0.93(0.26) | 0.91(0.29) | 0.85(0.36) |

Waveform | 21 | 66 | 0.98(0.14) | 0.92(0.27) | 0.93(0.26) | 0.88(0.33) |

Notes: The mean (and standard deviations in parentheses) rate of accepting the null hypothesis (i.e., ) under significance level 5% is reported. The two sets of samples are taken from the positive training set (i.e., the null hypothesis is correct). Methods having the mean acceptance rate 0.95 according to the one-sample *t*-test at the significance level 5% are in boldface.

Next, we consider the situation where the null hypothesis is not correct (the two distributions are different). The numerator samples are generated in the same way as above, but half of the denominator samples are replaced with negative training samples. Thus, while the numerator sample set contains only positive training samples, the denominator sample set includes both positive and negative samples. The experimental results are summarized in Table 2, showing that the adaptive LSTT with again compares favorably with those with and 1. Furthermore, LSTT with tends to outperform MMD in terms of type 2 error.

. | . | . | . | LSTT . | LSTT . | LSTT . |
---|---|---|---|---|---|---|

Data Sets . | d
. | . | MMD . | () . | () . | () . |

Banana | 2 | 100 | 0.52(0.50) | 0.10(0.30) | 0.02(0.14) | 0.17(0.38) |

Thyroid | 5 | 19 | 0.52(0.50) | 0.81(0.39) | 0.65(0.48) | 0.80(0.40) |

Titanic | 5 | 21 | 0.87(0.34) | 0.86(0.35) | 0.87(0.34) | 0.88(0.33) |

Diabetes | 8 | 85 | 0.31(0.46) | 0.42(0.50) | 0.47(0.50) | 0.57(0.50) |

Breast-Cancer | 9 | 29 | 0.87(0.34) | 0.75(0.44) | 0.80(0.40) | 0.79(0.41) |

Flare-Solar | 9 | 100 | 0.51(0.50) | 0.81(0.39) | 0.55(0.50) | 0.66(0.48) |

Heart | 13 | 38 | 0.53(0.50) | 0.28(0.45) | 0.40(0.49) | 0.62(0.49) |

German | 20 | 100 | 0.56(0.50) | 0.55(0.50) | 0.44(0.50) | 0.68(0.47) |

Ringnorm | 20 | 100 | 0.00(0.00) | 0.00(0.00) | 0.00(0.00) | 0.02(0.14) |

Waveform | 21 | 66 | 0.00(0.00) | 0.00(0.00) | 0.02(0.14) | 0.00(0.00) |

. | . | . | . | LSTT . | LSTT . | LSTT . |
---|---|---|---|---|---|---|

Data Sets . | d
. | . | MMD . | () . | () . | () . |

Banana | 2 | 100 | 0.52(0.50) | 0.10(0.30) | 0.02(0.14) | 0.17(0.38) |

Thyroid | 5 | 19 | 0.52(0.50) | 0.81(0.39) | 0.65(0.48) | 0.80(0.40) |

Titanic | 5 | 21 | 0.87(0.34) | 0.86(0.35) | 0.87(0.34) | 0.88(0.33) |

Diabetes | 8 | 85 | 0.31(0.46) | 0.42(0.50) | 0.47(0.50) | 0.57(0.50) |

Breast-Cancer | 9 | 29 | 0.87(0.34) | 0.75(0.44) | 0.80(0.40) | 0.79(0.41) |

Flare-Solar | 9 | 100 | 0.51(0.50) | 0.81(0.39) | 0.55(0.50) | 0.66(0.48) |

Heart | 13 | 38 | 0.53(0.50) | 0.28(0.45) | 0.40(0.49) | 0.62(0.49) |

German | 20 | 100 | 0.56(0.50) | 0.55(0.50) | 0.44(0.50) | 0.68(0.47) |

Ringnorm | 20 | 100 | 0.00(0.00) | 0.00(0.00) | 0.00(0.00) | 0.02(0.14) |

Waveform | 21 | 66 | 0.00(0.00) | 0.00(0.00) | 0.02(0.14) | 0.00(0.00) |

Notes: The mean (and standard deviations in parentheses) rate of accepting the null hypothesis (i.e., ) under significance level 5% is reported. The set of samples corresponding to the numerator of the density ratio is taken from the positive training set, and the set of samples corresponding to the denominator of the density ratio is taken from the positive training set and the negative training set (i.e., the null hypothesis is not correct). The best method having the lowest mean acceptance rate and comparable methods according to the two-sample *t*-test at significance level 5% are in boldface.

Overall, LSTT with is shown to be a useful method for a two-sample homogeneity test.

### 4.2. Inlier-Based Outlier Detection.

Next, we apply the proposed method to outlier detection.

#### 4.2.1. Density-Ratio Approach to Inlier-Based Outlier Detection.

Let us consider an outlier detection problem of finding irregular samples in a data set (called an evaluation data set) based on another data set (called a model dataset) that contains only regular samples. Defining the density ratio over the two sets of samples, we can see that the density-ratio values for regular samples are close to one, while those for outliers tend to deviatesignificantly from one. Thus, density-ratio values could be used as an index of the degree of outlyingness (Smola et al., 2009; Hido et al., 2011).

Since the evaluation data set usually has wider support than the model data set, we regard the evaluation data set as samples corresponding to the denominator density and the model data set as samples corresponding to the numerator density *p*(** x**). Then outliers tend to have smaller density-ratio values (i.e., close to zero). As such, density-ratio approximators can be used for outlier detection.

When evaluating the performance of outlier detection methods, it is important to take into account both the detection rate (i.e., the number of true outliers an outlier detection algorithm can find) and detection accuracy (i.e., the number of true inliers an outlier detection algorithm misjudges as outliers). Since there is a trade-off between detection rate and detection accuracy, we adopt the area under the ROC curve (AUC) as our error metric (Bradley, 1997).

#### 4.2.2. Artificial Data Sets.

First, we illustrate how the proposed method behaves in outlier detection scenarios using artificial data sets.

*d*is the dimensionality of

**and**

*x*

*1*_{d}is the

*d*-dimensional vector with all one. Note that this setup is the same as data set e described in section 2.4 when

*d*=1. Here, the samples drawn from

*N*(0,

*I*_{d}) are regarded as inliers, while the samples drawn from

*N*(

*d*

^{−1/2}

*1*_{d},

*I*_{d}) are regarded as outliers. We use samples.

Table 3 describes the AUC values for input dimensionality *d*=1, 5, and 10 for RuLSIF with , 0.5, and 0.95. This shows that as the input dimensionality *d* increases, the AUC values overall get smaller. Thus, outlier detection becomes more challenging in high-dimensional cases.

Input . | RuLSIF . | RuLSIF . | RuLSIF . |
---|---|---|---|

Dimensionality d
. | () . | () . | () . |

1 | .933(.089) | .926(.100) | .896(.124) |

5 | .882(.099) | .891(.091) | .894(.086) |

10 | .842(.107) | .850(.103) | .859(.092) |

Input . | RuLSIF . | RuLSIF . | RuLSIF . |
---|---|---|---|

Dimensionality d
. | () . | () . | () . |

1 | .933(.089) | .926(.100) | .896(.124) |

5 | .882(.099) | .891(.091) | .894(.086) |

10 | .842(.107) | .850(.103) | .859(.092) |

Note: The best method in terms of the mean AUC score and comparable methods according to the two-sample *t*-test at the significance level 5% are in boldface.

The result also shows that RuLSIF with small tends to work well when the input dimensionality is low, and RuLSIF with large works better as the input dimensionality increases. This tendency can be interpreted as follows: if is small, the density-ratio function tends to have a sharp hollow for outlier points (see the left-most graph in Figure 2e). Thus, as long as the true density-ratio function can be accurately estimated, small would be preferable in the outlier detection. When data dimensionality is low, density-ratio approximation is rather easy, and thus small tends to perform well. However, as the data dimensionality increases, density-ratio approximation gets harder, and thus large , which produces a smoother density-ratio function, is more favorable, since a smoother function can be more easily approximated than a bumpy one produced bysmall .

#### 4.2.3. Real-World Data Sets.

Next, we evaluate the proposed outlier detection method using various real-world data sets:

- •
**IDA repository:**The IDA repository (Rätsch et al., 2001) contains various binary classification tasks. Each data set consists of positive or negative and training or test samples. We use positive training samples as inliers in the model set. In the evaluation set, we use at most 100 positive test samples as inliers and the first 5% of negative test samples as outliers. Thus, the positive samples are treated as inliers, and the negative samples are treated as outliers. - •
**Speech data set:**An in-house speech data set, which contains short utterance samples recorded from two male subjects speaking in French with a sampling rate of 44.1 kHz. From each utterance sample, we extracted a 50-dimensional line spectral frequencies vector (Kain & Macon, 1998). We randomly take 200 samples from one class and assign them to the model data set. Then we randomly take 200 samples from the same class and 10 samples from the other class. - •
**20 Newsgroup data set:**The 20-Newsgroups data set (http://people.csail.mit.edu/jrennie/20Newsgroups/) contains 20, 000 newsgroup documents, which contains the following four top-level categories: comp, rec, sci, and talk. Each document is expressed by a 100-dimensional bag-of-words vector of term frequencies. We randomly take 200 samples from the comp class and assign them to the model data set. Then we randomly take 200 samples from the same class and 10 samples from one of the other classes for the evaluation data set. - •
**USPS handwritten digit data set:**The USPS handwritten digit data set (http://www.gaussianprocess.org/gpml/data/) contains 9298 digit images. Each image consists of 256 (=) pixels, and each pixel takes an integer value between 0 and 255 as the intensity level. We regard samples in one class as inliers and samples in other classes as outliers. We randomly take 200 samples from the inlier class and assign them to the model data set. Then we randomly take 200 samples from the same inlier class and 10 samples from one of the other classes for the evaluation data set.

We compare the AUC scores of RuLSIF with , 0.5, and 0.95 and one-class support vector machine (OSVM) with the gaussian kernel (Schölkopf, Platt, Shawe-Taylor, Smola, & Williamson, 2001). We used the LIBSVM implementation of OSVM (Chang & Lin, 2001). The gaussian width is set to the median distance between samples, which has been shown to be a useful heuristic (Schölkopf et al., 2001). Since there is no systematic method to determine the tuning parameter in OSVM, we report the results for and 0.1.

The mean and standard deviation of the AUC scores over 100 runs with random sample choice are summarized in Table 4, showing that RuLSIF overall compares favorably with OSVM. Among the RuLSIF methods, small tends to perform well for low-dimensional data sets, and large tends to work well for high-dimensional data sets. This tendency agrees well with that for the artificial data sets (see section 4.2.2).

. | . | OSVM . | OSVM . | RuLSIF . | RuLSIF . | RuLSIF . |
---|---|---|---|---|---|---|

Date Sets . | d
. | () . | () . | () . | () . | () . |

IDA:banana | 2 | .668(.105) | .676(.120) | .597(.097) | .619(.101) | .623(.115) |

IDA:thyroid | 5 | .760(.148) | .782(.165) | .804(.148) | .796(.178) | .722(.153) |

IDA:titanic | 5 | .757(.205) | .752(.191) | .750(.182) | .701(.184) | .712(.185) |

IDA:diabetes | 8 | .636(.099) | .610(.090) | .594(.105) | .575(.105) | .663(.112) |

IDA:b-cancer | 9 | .741(.160) | .691(.147) | .707(.148) | .737(.159) | .733(.160) |

IDA:f-solar | 9 | .594(.087) | .590(.083) | .626(.102) | .612(.100) | .584(.114) |

IDA:heart | 13 | .714(.140) | .694(.148) | .748(.149) | .769(.134) | .726(.127) |

IDA:german | 20 | .612(.069) | .604(.084) | .605(.092) | .597(.101) | .605(.095) |

IDA:ringnorm | 20 | .991(.012) | .993(.007) | .944(.091) | .971(.062) | .992(.010) |

IDA:waveform | 21 | .812(.107) | .843(.123) | .879(.122) | .875(.117) | .885(.102) |

Speech | 50 | .788(.068) | .830(.060) | .804 (.101) | .821(.076) | .836(.083) |

20News (‘rec’) | 100 | .598(.063) | .593(.061) | .628(.105) | .614(.093) | .767(.100) |

20News (‘sci’) | 100 | .592(.069) | .589(.071) | .620(.094) | .609(.087) | .704(.093) |

20News (‘talk’) | 100 | .661(.084) | .658(.084) | .672(.117) | .670(.102) | .823(.078) |

USPS (1 vs. 2) | 256 | .889(.052) | .926(.037) | .848(.081) | .878(.088) | .898(.051) |

USPS (2 vs. 3) | 256 | .823(.053) | .835(.050) | .803(.093) | .818(.085) | .879(.074) |

USPS (3 vs. 4) | 256 | .901(.044) | .939(.031) | .950(.056) | .961(.041) | .984(.016) |

USPS (4 vs. 5) | 256 | .871(.041) | .890(.036) | .857(.099) | .874(.082) | .941(.031) |

USPS (5 vs. 6) | 256 | .825(.058) | .859(.052) | .863(.078) | .867(.068) | .901(.049) |

USPS (6 vs. 7) | 256 | .910(.034) | .950(.025) | .972(.038) | .984(.018) | .994(.010) |

USPS (7 vs. 8) | 256 | .938(.030) | .967(.021) | .941(.053) | .951(.039) | .980(.015) |

USPS (8 vs. 9) | 256 | .721(.072) | .728(.073) | .721(.084) | .728(.083) | .761(.096) |

USPS (9 vs. 0) | 256 | .920(.037) | .966(.023) | .982(.048) | .989(.022) | .994(.011) |

. | . | OSVM . | OSVM . | RuLSIF . | RuLSIF . | RuLSIF . |
---|---|---|---|---|---|---|

Date Sets . | d
. | () . | () . | () . | () . | () . |

IDA:banana | 2 | .668(.105) | .676(.120) | .597(.097) | .619(.101) | .623(.115) |

IDA:thyroid | 5 | .760(.148) | .782(.165) | .804(.148) | .796(.178) | .722(.153) |

IDA:titanic | 5 | .757(.205) | .752(.191) | .750(.182) | .701(.184) | .712(.185) |

IDA:diabetes | 8 | .636(.099) | .610(.090) | .594(.105) | .575(.105) | .663(.112) |

IDA:b-cancer | 9 | .741(.160) | .691(.147) | .707(.148) | .737(.159) | .733(.160) |

IDA:f-solar | 9 | .594(.087) | .590(.083) | .626(.102) | .612(.100) | .584(.114) |

IDA:heart | 13 | .714(.140) | .694(.148) | .748(.149) | .769(.134) | .726(.127) |

IDA:german | 20 | .612(.069) | .604(.084) | .605(.092) | .597(.101) | .605(.095) |

IDA:ringnorm | 20 | .991(.012) | .993(.007) | .944(.091) | .971(.062) | .992(.010) |

IDA:waveform | 21 | .812(.107) | .843(.123) | .879(.122) | .875(.117) | .885(.102) |

Speech | 50 | .788(.068) | .830(.060) | .804 (.101) | .821(.076) | .836(.083) |

20News (‘rec’) | 100 | .598(.063) | .593(.061) | .628(.105) | .614(.093) | .767(.100) |

20News (‘sci’) | 100 | .592(.069) | .589(.071) | .620(.094) | .609(.087) | .704(.093) |

20News (‘talk’) | 100 | .661(.084) | .658(.084) | .672(.117) | .670(.102) | .823(.078) |

USPS (1 vs. 2) | 256 | .889(.052) | .926(.037) | .848(.081) | .878(.088) | .898(.051) |

USPS (2 vs. 3) | 256 | .823(.053) | .835(.050) | .803(.093) | .818(.085) | .879(.074) |

USPS (3 vs. 4) | 256 | .901(.044) | .939(.031) | .950(.056) | .961(.041) | .984(.016) |

USPS (4 vs. 5) | 256 | .871(.041) | .890(.036) | .857(.099) | .874(.082) | .941(.031) |

USPS (5 vs. 6) | 256 | .825(.058) | .859(.052) | .863(.078) | .867(.068) | .901(.049) |

USPS (6 vs. 7) | 256 | .910(.034) | .950(.025) | .972(.038) | .984(.018) | .994(.010) |

USPS (7 vs. 8) | 256 | .938(.030) | .967(.021) | .941(.053) | .951(.039) | .980(.015) |

USPS (8 vs. 9) | 256 | .721(.072) | .728(.073) | .721(.084) | .728(.083) | .761(.096) |

USPS (9 vs. 0) | 256 | .920(.037) | .966(.023) | .982(.048) | .989(.022) | .994(.011) |

Notes: Mean AUC score (and standard deviations in parentheses) over 100 trials is reported. The best method having the highest mean AUC score and comparable methods according to the two-sample *t*-test at the significance level 5% are in boldface. The data sets are sorted in ascending order of the input dimensionality *d*.

### 4.3. Transfer Learning.

Finally, we apply the proposed method to transfer learning.

#### 4.3.1. Transductive Transfer Learning by Importance Sampling.

Let us consider a problem of semisupervised learning (Chapelle, Schölkopf, & Zien, 2006) from labeled training samples and unlabeled test samples . The goal is to predict a test output value *y*^{te} for a test input point *x*^{te}. Here, we consider the setup where the labeled training samples are drawn i.i.d. from *p*(*y*|** x**)

*p*

_{tr}(

**), while the unlabeled test samples are drawn i.i.d. from**

*x**p*

_{te}(

**), which is generally different from**

*x**p*

_{tr}(

**); the (unknown) test sample (**

*x*

*x*^{te},

*y*

^{te}) follows

*p*(

*y*|

**)**

*x**p*

_{te}(

**). This setup means that the conditional probability**

*x**p*(

*y*|

**) is common to training and test samples, but the marginal densities**

*x**p*

_{tr}(

**) and**

*x**p*

_{te}(

**) are generally different for training and test input points. Such a problem is called transductive transfer learning (Pan & Yang, 2010), domain adaptation (Jiang & Zhai, 2007), or covariate shift (Shimodaira, 2000; Sugiyama & Kawanabe, 2012).**

*x**y*and (at input

**). Then the generalization error, which we would ultimately like to minimize, is defined as where**

*x**f*(

**) is a function model. Since the generalization error is inaccessible because the true probability**

*x**p*(

*y*|

**)**

*x**p*

_{te}(

**) is unknown, empirical error minimization is often used in practice (Vapnik, 1998): However, under the covariate shift setup, plain empirical error minimization is not consistent (i.e., it does not converge to the optimal function) if the model is misspecified (i.e., the true function is not included in the model; see Shimodaira, 2000). Instead, the following importance-weighted empirical error minimization is consistent under covariate shift: where**

*x**r*(

**) is called the importance (Fishman, 1996) in the context of covariate shift adaptation:**

*x**r*(

**) (i.e., ) is rather hard, as shown in this letter. Thus, when**

*x**r*(

**) is estimated poorly, all flattened weights are also unreliable, and then covariate shift adaptation does not work well in practice. To cope with this problem, we propose to use relative importance weights alternatively, where () is the -relative importance weight defined by Note that compared with the definition of the -relative density-ratio (1), and are swapped in order to be consistent with exponential flattening. Indeed, the relative importance weights play a similar role to exponentially flattened importance weights; corresponds to plain empirical error minimization, while corresponds to importance-weighted empirical error minimization; will give an intermediate estimator that balances the trade-off between efficiency and consistency. We note that the relative importance weights and exponentially flattened importance weights agree only when and ; for , they are generally different.**

*x*A possible advantage of the above relative importance weights is that its estimation for does not depend on that for , unlike exponentially flattened importance weights. Since -relative importance weights for can be reliably estimated by RuLSIF proposed in this letter, the performance of covariate shift adaptation is expected to be improved. Below, we experimentally investigate this effect.

#### 4.3.2. Artificial Data Sets.

First, we illustrate how the proposed method behaves in covariate shift adaptation using one-dimensional artificial data sets.

*N*(0, 0.01). We set

*n*

_{tr}=100 and

*n*

_{te}=200. Figure 6b shows a realization of training and test samples as well as learned functions obtained by RIW-LS with and EIW-LS with . This shows that RIW-LS with and EIW-LS with give almost the same functions, and both functions fit the true function well in the test region. Figure 6c shows the mean and standard deviation of the test error under the squared loss over 200 runs as functions of the relative flattening parameter in RIW-LS and the exponential flattening parameter in EIW-LS. The method having a lower mean test error and another method that is comparable according to the two-sample

*t*-test at the significance level 5% are specified by . As can be observed, the proposed RIW-LS compares favorably with EIW-LS.

#### 4.3.3. Real-World Data Sets.

Finally, we evaluate the proposed transfer learning method on a real-world transfer learning task.

We consider the problem of human activity recognition from accelerometer data collected by iPod Touch (http://alkan.mns.kyutech.ac.jp). In the data collection procedure, subjects were asked to perform a specific action such as walking, running, or bicycle riding. The duration of each task was arbitrary, and the sampling rate was 20 Hz with small variations. An example of three-axis accelerometer data for “walking” is plotted in Figure 8.

To extract features from the accelerometer data, each data stream was segmented in a sliding window manner with window width 5 seconds and sliding step 1 second. Depending on subjects, the position and orientation of iPod Touch well arbitrary—held by hand or kept in a pocket or a bag. For this reason, we decided to take the -norm of the three-dimensional acceleration vector at each time step and computed the following five orientation-invariant features from each window: mean, standard deviation, fluctuation of amplitude, average energy, and frequency-domain entropy (Bao & Intille, 2004; Bharatula, Stager, Lukowicz, & Troster, 2005).

Let us consider a situation where a new user wants to use the activity recognition system. However, since the new user is not willing to go to the trouble of labeling his or her accelerometer data, no labeled sample is available for the new user. However, unlabeled samples for the new user and labeled data obtained from existing users are available. Let labeled training data be the set of labeled accelerometer data for 20 existing users. Each user has at most 100 labeled samples for each action. Let unlabeled test data be unlabeled accelerometer data obtained from the new user.

We use kernel logistic regression (KLR) for activity recognition. We compare the following four methods:

- •
Plain KLR without importance weights (i.e., or )

- •
KLR with relative importance weights for

- •
KLR with exponentially flattened importance weights for

- •
KLR with plain importance weights (i.e., or )

The experiments are repeated 100 times with different sample choice for *n*_{tr}=500 and *n*_{te}=200. Table 5 depicts the classification accuracy for three binary-classification tasks: walk versus run, walk versus riding a bicycle, and walk versus taking a train. The classification accuracy is evaluated for 800 samples from the new user that are not used for classifier training (i.e., the 800 test samples are different from 200 unlabeled samples). The table shows that KLR with relative importance weights for compares favorably with other methods in terms of classification accuracy. KLR with plain importance weights and KLR with exponentially flattened importance weights for are outperformed by KLR without importance weights in the walk versus riding a bicycle task due to the instability of importance weight estimation for or .

. | KLR . | RIW-KLR . | EIW-KLR . | IW-KLR . |
---|---|---|---|---|

Task . | (, ) . | () . | () . | (, ) . |

Walks vs. run | 0.803(0.082) | 0.889(0.035) | 0.882(0.039) | 0.882(0.035) |

Walks vs. bicycle | 0.880(0.025) | 0.892(0.035) | 0.867(0.054) | 0.854(0.070) |

Walks vs. train | 0.985(0.017) | 0.992(0.008) | 0.989(0.011) | 0.983(0.021) |

. | KLR . | RIW-KLR . | EIW-KLR . | IW-KLR . |
---|---|---|---|---|

Task . | (, ) . | () . | () . | (, ) . |

Walks vs. run | 0.803(0.082) | 0.889(0.035) | 0.882(0.039) | 0.882(0.035) |

Walks vs. bicycle | 0.880(0.025) | 0.892(0.035) | 0.867(0.054) | 0.854(0.070) |

Walks vs. train | 0.985(0.017) | 0.992(0.008) | 0.989(0.011) | 0.983(0.021) |

Notes: Mean classification accuracy (and standard deviations in parentheses) over 100 runs for activity recognition of a new user is reported. The method having the lowest mean classification accuracy and comparable methods according to the two-sample *t*-test at the significance level 5% are in bold.

Overall, the proposed relative density-ratio estimation method was shown to be useful in transfer learning under covariate shift.

## 5. Conclusion

In this letter, we proposed using a relative divergence for robust distribution comparison. We gave a computationally efficient method for estimating the relative Pearson divergence based on direct relative density-ratio approximation. We theoretically elucidated the convergence rate of the proposed divergence estimator under a nonparametric setup, which showed that the proposed approach of estimating the relative Pearson divergence is preferable to the existing approach of estimating the plain Pearson divergence. Furthermore, we proved that the asymptotic variance of the proposed divergence estimator is independent of model complexity under a correctly specified parametric setup. Thus, the proposed divergence estimator hardly overfits even with complex models. Experimentally, we demonstrated the practical usefulness of the proposed divergence estimator in a two-sample homogeneity test, inlier-based outlier detection, and transductive transfer learning under covariate shift.

In addition to the two-sample homogeneity test, outlier detection, and transfer learning, density ratios were shown to be useful for tackling various machine learning problems, including multitask learning (Bickel, Bogojeska, Lengauer, & Scheffer, 2008; Simm, Sugiyama, & Kato, 2011), independence test (Sugiyama & Suzuki, 2011), feature selection (Suzuki, Sugiyama, Kanamori, & Sese, 2009), causal inference (Yamada & Sugiyama, 2010), independent component analysis (Suzuki & Sugiyama, 2011), dimensionality reduction (Suzuki & Sugiyama, 2010), unpaired data matching (Yamada & Sugiyama, 2011), clustering (Kimura & Sugiyama, 2011), conditional density estimation (Sugiyama et al., 2010), and probabilistic classification (Sugiyama, 2010). Thus, it would be promising to explore more applications of the proposed relative density-ratio approximator beyond two-sample homogeneity test, outlier detection, and transfer learning tasks.

## Appendix A: Technical Details of NonparametricConvergence Analysis

Here, we give the technical details of the nonparametric convergence analysis described in section 3.1.

### A.1. Results.

*S*as We estimate the Pearson divergence between

*p*and through estimating the density ratio, Let us consider the following density-ratio estimator: where and

*R*(

*g*) is a nonnegative regularization functional such that A possible estimator of the Pearson (PE) divergence is Another possibility is

*R*(

*g*). Suppose is an RKHS associated with bounded kernel : Let denote the norm in the RKHS . Then satisfies equation A.1: where we used the reproducing property of the kernel and Schwartz's inequality. Note that the gaussian kernel satisfies this with

*C*=1. It is known that the gaussian kernel RKHS spans a dense subset in the set of continuous functions. Another example of RKHSs is Sobolev space. The canonical norm for this space is the integral of the squared derivatives of functions. Thus, the regularization term imposes the solution to be smooth. The RKHS technique in Sobolev space has been well exploited in the context of spline models (Wahba, 1990). We intend that the regularization term

*R*(

*g*) is a generalization of the RKHS norm. Roughly speaking,

*R*(

*g*) is like a “norm” of the function space .

*M*>0: To derive the convergence rate of our estimator, we utilize the bracketing entropy that is a complexity measure of a function class (van der Vaart & Wellner, 1996).

*Given two functions l and u, the bracket [l, u] is the set of all functions f with for all x. An -bracket with respect to is a bracket [l, u] with . The bracketing entropy is the logarithm of the minimum number of -brackets with respect to needed to cover a function set *.

*M*>0, This quantity represents a complexity of function class —the larger is, the more complex the function class is because, for larger , more brackets are needed to cover the function class. The gaussian RKHS satisfies this condition for arbitrarily small (Steinwart & Scovel, 2007). Note that when

*R*(

*g*) is the RKHS norm, condition A.2 holds for all

*M*>0 if that holds for

*M*=1.

Then we have the following theorem:

In the proof of theorem 1, we use the following auxiliary lemma:

### A.2. Proof of Lemma 1.

First, we prove lemma 1.

*f*and

_{L}*f*be a -bracket for with respect to

_{U}*L*

_{2}(

*p*); and . Without loss of generality, we can assume that . Then and defined as are also a bracket such that for all s.t. and because and the following relation is met: Therefore, the condition for the bracketing entropies, equation A.2, gives. We can also show that in the same fashion.

*L*

_{2}(

*S*)-norm of

*f*as

The following is similar to the argument in theorem 10.6 in van de Geer (2000), but we give a simpler proof.

*a*,

*b*>0. Applying this relation to equation A.7, we obtain which indicates Therefore, by moving to the left-hand side, we obtain This gives Consequently, the proof of lemma 1 is completed.

### A.3. Proof of Theorem 1.

Based on lemma 1, we prove theorem 1.

## Appendix B: Technical Details of Parametric Variance Analysis

Here, we give the technical details of the parametric variance analysis described in section 3.2.

### B.1. Results.

*b*is a finite number. Let us consider the following estimator of -relative density ratio, Suppose that the model is correctly specified; there exists such that Then, under a mild assumption (see theorem 5.23 of van der Vaart, 2000), the estimator is consistent and the estimated parameter satisfies the asymptotic normality in the large sample limit. Then a possible estimator of the -relative Pearson divergence is Note that there are other possible estimators for such as

We study the asymptotic properties of . The expectation under the probability *p* () is denoted as (). Likewise, the variance is denoted as (). Then we have the following theorem.

### B.2. Proof of Theorem 2.

*G*as Likewise, the operator is defined for the samples from . Then we have The second equality follows from Then the asymptotic variance is given as We confirm that both and are nonnegative and increasing functions with respect to

*r*for any . Since the result is trivial for , we suppose . The function is represented as