Abstract

Many machine learning problems, such as nonstationarity adaptation, outlier detection, dimensionality reduction, and conditional density estimation, can be effectively solved by using the ratio of probability densities. Since the naive two-step procedure of first estimating the probability densities and then taking their ratio performs poorly, methods to directly estimate the density ratio from two sets of samples without density estimation have been extensively studied recently. However, these methods are batch algorithms that use the whole data set to estimate the density ratio, and they are inefficient in the online setup, where training samples are provided sequentially and solutions are updated incrementally without storing previous samples. In this letter, we propose two online density-ratio estimators based on the adaptive regularization of weight vectors. Through experiments on inlier-based outlier detection, we demonstrate the usefulness of the proposed methods.

1  Introduction

Almost all machine learning problems can be solved through density estimation, because knowing the probability density is equivalent to knowing everything about the data. Thus, density estimation is the most versatile approach to machine learning, and various methods have been proposed so far. However, without strong parametric assumptions, density estimation is hard to perform accurately in high-dimensional problems. Thus, it is desirable to solve target machine learning tasks directly without performing density estimation (Vapnik, 1998).

Following this idea, the machine learning approach based on the ratio of probability densities has attracted attention recently (Sugiyama, Suzuki, & Kanamori, 2012a). The rationale behind density ratio estimation is that many machine learning problems, such as transfer learning, outlier detection, change detection, and dimension reduction, can be solved in a unified manner using just the density ratio. By directly estimating this density ratio, the difficult task of density estimation can be avoided, leading to better empirical performance.

Due to the practical utility, several methods for density ratio estimation have been proposed (Kanamori, Hido, & Sugiyama, 2009; Sugiyama, Suzuki, & Kanamori, 2012b; Izbicki, Lee, & Schafer, 2014). In Sugiyama et al. (2012b) showed that density-ratio estimation may be performed by matching the density ratio to a model of the density ratio under a Bregman divergence. This view of density-ratio estimation is of great interest, since it relates to several existing methods, and the resulting estimators can be interpreted in terms of the Bregman divergence. The simplest Bregman divergence corresponds to the squared loss between the density ratio and its model (Kanamori et al., 2009). The main advantage of the least-squares-based density-ratio estimator is that the solution can be analytically obtained. Another choice for the Bregman divergence is the Kullback-Leibler-loss. This Kullback-Leibler-based estimator also appears in the variational estimation of the Kullback-Leibler divergence (Nguyen, Wainwright, & Jordan, 2010). The main disadvantage of the Kullback-Leibler-based estimator is that it does not have a closed-form solution, and optimization is usually performed via gradient or quasi-Newton methods. Although these estimators have been demonstrated to work well on many different problems (Sugiyama et al., 2012a; Sugiyama & Kawanabe, 2012), they work only in a batch mode and thus are not efficient in online problems where training samples are provided sequentially and solutions are updated incrementally without storing previous samples.

In this letter, we propose online algorithms of the least-squares and Kullback-Leibler-based density-ratio estimators in the framework of adaptive regularization of weight vectors (Crammer, Kulesza, & Dredze, 2009), which was originally proposed for regression and classification. We experimentally demonstrate that for a fixed computational budget, our proposed online algorithms achieve greater performance than both the batch solutions and online solutions via a naive stochastic gradient descent in inlier-based outlier detection.

2  Batch Density-Ratio Estimation

In this section, we formulate the batch density-ratio estimation problem and review density-ratio estimation using Bregman divergences.

2.1  Problem Formulation

Suppose that we are given a set of independent and identically distributed (i.i.d.) samples from a probability distribution with density and another set of i.i.d. samples from a probability distribution with density in the same domain. Under the assumption that for all , our goal is to estimate the density-ratio function,
from and .

A naive approach is to estimate and from and separately and take the ratio of the estimated densities. However, such a two-step plug-in approach is not reliable because the first step of density estimation is performed without regard to the second step of taking the ratio (Sugiyama et al., 2012a). Below, we review a direct density-ratio estimation method that does not involve density estimation.

2.2  Batch Algorithm

Here, we review batch density-ratio estimation algorithms using Bregman divergences.

2.2.1  General Framework with Bregman Divergences

The density ratio can be estimated by minimizing the Bregman divergence between the true density ratio and a parameterized model of the density ratio, (Sugiyama et al., 2012b). The Bregman divergence from to t is defined as (Bregman, 1967)
2.1
where is a strictly convex function. The above is minimized with respect to t when . The above distance is useful, since occurs linearly in all terms that include both and t. Minimizing the Bregman divergence between the true density ratio and a model of the density ratio , weighted by , gives
where is constant with regard to . The empirical version for the above problem is therefore
where we included a regularization term with regularization parameter .
Let us consider the following linear-in-parameter model for density ratios,
where b is the number of basis functions;
is the vector of basis functions; and
is the vector of parameters; and denotes the transpose.

A final consideration is choosing a suitable Bregman divergence by defining the function . We discuss two such choices for .

2.2.2  Kullback-Leibler Approach

Choosing the Bregman divergence as results in the Kullback-Leibler (KL) divergence:
The empirical and regularized optimization problem is given by
where is the regularization parameter.
Since the objective function is smooth and convex, we may find the globally optimal solution by a standard optimization technique such as gradient descent or quasi-Newton methods. The gradient of the above objective function with respect to is given by
The regularization parameter and hyperparameters included in the basis function can be objectively selected via cross-validation with respect to the objective function.

2.2.3  Least-Squares Approach

Choosing the Bregman divergence as gives the squared loss:
This corresponds to a least-squares fitting of the density-ratio model to the true density ratio (Kanamori et al., 2009). The objective function can then be expressed as
By substituting the linear model for , this is simplified as a quadratic problem:
where is a matrix and is a vector given as
The solution can then be analytically calculated as
where is the identity matrix. All hyperparameters in the model can be objectively set via cross-validation.

3  Online Density-Ratio Estimation

In this section, we consider an online learning setup where samples and following and are given sequentially at time step t. We first propose an online KL-based density‐ratio estimator and then an LS-based online density-ratio estimator.

3.1  Online KL Density-Ratio Estimation

Given the current parameter that has been estimated using and , the basic idea for the online method is to update the parameter to minimize the error for the next samples and :
We employ the idea of adaptive regularization of weight vectors (AROW) (Crammer et al., 2009). For parameter vector , the normal distribution with the mean vector and the covariance matrix is considered, and the update of parameters is penalized by the KL divergence. More specifically, the AROW KL-based training criterion to be minimized is
3.1
where is the passiveness parameter. The first two terms on the right-hand side of equation 3.1 correspond to the KL error for the next samples and , the third term is the KL penalty for parameter updates, and the fourth term is the regularizer for covariance matrix .
The optimality condition with regard to the mean is
Note that since the domain of is , in the above we should take into account the constraint . We can simplify the above by making the substitution
and multiplying the entire equation with :
3.2
Multiplying this with from the left of equation 3.2, we have
3.3
Again multiplying with from the left gives
and collecting the terms gives
By defining
we can solve it for as
Note that this quadratic has two solutions. However, since because the domain of the logarithm is positive, the solution is given by
This is then substituted in equation 3.3 to obtain the update rule:
3.4
Next, we equate the derivative of with respect to to zero:
Using the Sherman-Morrison formula1 gives
3.5
where we put

3.2  Online LS Density Ratio Estimation

Given the current parameter that has been estimated using and , the basic idea for the online method is to update the parameter to minimize the error for the next samples and :
Then the AROW-type update rule, derived just as in section 3.1, is
where the second line follows from the application of the Sherman-Morrison formula and is defined as
The update rule for is exactly the same as that for the KL method (see equation 3.5).

Note that the above online LS density-ratio estimator can be regarded as an application of classical recursive least-squares (Haykin, 2002) to density-ratio estimation.

4  Experiments

In this section, we experimentally investigate the performance of the proposed online density-ratio estimators on the problem of online inlier-based outlier detection (Hido, Tsuboi, Kashima, Sugiyama, & Kanamori, 2011).

4.1  Setup

Following Hido et al. (2011), we formulate the inlier-based outlier detection problem as the problem of estimating the density ratio , where is the density of inliers and is the density of unlabeled samples (i.e., a mixture of inliers and outliers). Samples for which the density ratio is low tend to be outliers (see Figure  1). This is due to the fact that the probability that a sample is an inlier is proportional to the density ratio. To see this, assume that inliers are labeled as and outliers are labeled as . Then the probability that sample is an inlier is
To identify outliers, we are therefore interested in areas where the density ratio is low. Note that traditional outlier detection methods assume that outliers occur in areas where the inlier density is low. Inlier-based outlier detection is, however, not constrained by such an assumption and can identify outliers that occur even in high-density areas (see Figures 1 and 2).
Figure 1:

An example of density-ratio-based outlier detection. (a) Densities for a data set consisting of inliers and a corrupted data set consisting of both inliers and outliers . (b) The density ratio takes a small value in the region where significantly differs from . It is in this region where outliers occur.

Figure 1:

An example of density-ratio-based outlier detection. (a) Densities for a data set consisting of inliers and a corrupted data set consisting of both inliers and outliers . (b) The density ratio takes a small value in the region where significantly differs from . It is in this region where outliers occur.

Figure 2:

An example of density ratio–based outlier detection when the inlier and outlier distributions overlap. We see that (a) even when outliers occur in high-density areas, they can be (b) identified via the density ratio.

Figure 2:

An example of density ratio–based outlier detection when the inlier and outlier distributions overlap. We see that (a) even when outliers occur in high-density areas, they can be (b) identified via the density ratio.

In the experiments that follow, we first prepare training and test sets that both contain inliers and outliers. Then we form the inlier set from the inliers contained in the training set and form the unlabeled set from both the inliers and outliers in the training set.

We initially give to outlier detectors inlier samples and unlabeled samples randomly chosen from the inlier and unlabeled sets, respectively. Then a pair of randomly chosen inlier and unlabeled samples is given to outlier detectors in an online manner over iterations. In practice, the class prior of the mixture data set is unknown. Therefore, the score calculated by cannot be appropriately thresholded.2 Therefore, instead of the classification accuracy, the performance of outlier detectors is evaluated by the area under the receiver operating characteristic curve (AUC) for the test set. The AUC is used since it is independent of the particular class prior of the unlabeled data set or threshold.

For density-ratio estimation, we use gaussian kernels as basis functions :
where denotes the Euclidean norm, is the gaussian bandwidth, and are the gaussian centers chosen from the initial numerator (inlier) samples. The density ratio was estimated using these methods:
• The batch KL method (Batch-KL) described in section 2

• The online KL method (AROW-KL) proposed in section 3

• The online KL method with naive stochastic gradient descent (SGD-KL)

• The batch LS method (Batch-LS) described in section 2

• The online LS method (AROW-LS), described in section 3

• The online LS method with naive stochastic gradient descent (SGD-LS)

Density-ratio estimators contain unknown hyperparameters. Cross-validation is the standard way to select these hyperparameter values. However, previous samples are not stored in the online methods, and therefore cross-validation cannot be directly employed. Here, we maintain all models with different hyperparameter values throughout the online learning process and use newer samples for validation to choose the best model at each time step. More specifically, at time , estimation is carried out using samples only up to time t, and the latest samples are used for computing the validation error with respect to the objective function. For fair comparison, we also use the same hyperparameter selection scheme for the batch methods.

The gaussian kernel model for all density-ratio estimators contains a hyperparameter for the kernel width, which was selected from the following candidates:
In addition to the kernel width, each method contains one additional hyperparameter. For each batch density-ratio estimator, a regularization parameter is selected from
and for each AROW-based density-ratio, a passiveness parameter is selected from
For the stochastic gradient descent methods, the step size was treated as a hyperparameter and selected from

4.2  Spambase Data Set

First, we perform experiments using the Spambase data set, which contains e-mail samples with attributes.3 There are spam samples in the data set. Among attributes, we use only (the percentage of words in an e-mail) for outlier detection. Twenty-five percent of the data set is used for evaluation, and training data are the remaining . It is assumed that all the positive samples are inliers. The probability that an outlier occurs is set to .

Figure 3a (left) depicts the AUC values as a function of the sample size. This figure shows that the batch methods are generally more accurate than their online counterparts. This is expected since the batch methods have access to all samples. The proposed online methods are not much worse than their batch counterparts and much better than naive stochastic gradient descent–based online methods.

Figure 3:

Experimental results for the Spambase (left column) and MNIST (right column) data sets. The standard error is given as error bars. Batch-KL and Batch-LS use all the samples in a batch setup, and AROW-KL and AROW-LS” are the proposed AROW-based online methods. SGD-KL and SGD-LS are the naive stochastic gradient descent–based online methods.

Figure 3:

Experimental results for the Spambase (left column) and MNIST (right column) data sets. The standard error is given as error bars. Batch-KL and Batch-LS use all the samples in a batch setup, and AROW-KL and AROW-LS” are the proposed AROW-based online methods. SGD-KL and SGD-LS are the naive stochastic gradient descent–based online methods.

Another interesting thing to note is that the batch KL method is generally better than the batch LS method. Figure 1 shows that outliers occur in areas where the density ratio is small. In Figure 4 the KL divergence and squared loss is plotted. From these plots, we can confirm that the KL divergence penalizes error more severely when the density ratio is small. This may contribute to the greater accuracy in identifying outliers.

Figure 4:

The squared loss and the KL loss for different values of . The KL loss penalizes errors when is small much more strongly.

Figure 4:

The squared loss and the KL loss for different values of . The KL loss penalizes errors when is small much more strongly.

Figure 3b (left) depicts the cumulative computation time as a function of the sample size, showing that the computation time of the online methods is significantly lower than the batch methods. Stochastic gradient descent–based online methods are faster than the proposed AROW-based online methods, perhaps since the proposed methods contain matrix that is updated at each time step. This may be mitigated by approximating by a diagonal matrix, as in the original AROW paper (Crammer et al., 2009). Due to the fact that the LS method has an analytic solution, the batch LS method is much faster than the batch KL method. This aspect of the LS method is a major motivation that it is often preferred over the KL method. However, we see that in the online setup, both the KL and LS methods are about the same speed.

In Figure 3c (left), we plot the cumulative computation time against the AUC values. From this, we see that with a limited computational budget, the proposed online methods significantly outperform both their batch and the stochastic gradient descent counterparts.

4.3  MNIST Data Set

Next, we use the MNIST data set, which contains images of handwritten digits of size pixels.4 Each image is represented by a -dimensional feature vector that is much higher dimensional than the previous Spambase data set. Each pixel in the images is normalized to , representing its gray-scale intensity level.

For the first experiment, we use the images of 4 or 9 in the data sets and regard 4 as inliers and 9 as outliers. Twenty-five percent of the data set is used for evaluation, and the remaining is training data. The probability that an outlier is drawn in the unlabeled data set was set to .

The experimental results are given in the right-hand column of Figure 3. From the graphs, we see that the results show similar tendencies to the previous results. That is, the proposed online methods are significantly faster than their batch versions. Furthermore, due to the better estimate of the density ratio, the outlier detection accuracy is much higher than the online stochastic gradient methods. With a limited computational budget, the proposed methods also give higher classification accuracy than all other methods.

For the second experiment, we choose one digit as an inlier class and regard all other digits as outliers. The initial inlier data set contained samples and the unlabeled data set samples. The probabilty that an outlier occurs in the unlabeled data set was set to . This experiment was performed by selecting 1–9 in turn as inliers (i.e., nine separate experiments). The AUC versus cumulative computation time is plotted in Figure 5. From the graphs, we again see that for a limited computational budget, the proposed online KL divergence gives more accurate results.

Figure 5:

Computational time versus accuracy when a single digit is considered an inlier and the rest are considered outliers.

Figure 5:

Computational time versus accuracy when a single digit is considered an inlier and the rest are considered outliers.

5  Conclusion

Various machine learning problems can be solved using density-ratio estimation, which can be performed by matching the density ratio to a model under a Bregman divergence. Two popular approaches are to use the Kullback-Leibler loss and the squared loss. In this letter, we extended the original batch density-ratio estimators to an online learning scenario based on the idea of adaptive regularization of weight vectors, which has been successfully used in regression and classification (Crammer et al., 2009). Through experiments on inlier-based outlier detection (Hido et al., 2011), we demonstrated the usefulness of the proposed methods. We showed that for a given computational budget, online AROW-based methods outperform both online stochastic gradient–descent and batch methods. We also showed that the KL divergence–based loss may be more suited to the outlier detection problem.

Acknowledgments

We thank Tomoya Sakai for his valuable comments. M.C.dP. was supported by the JST CREST project, and M.S. was supported by KAKENHI 25700022.

References

Bregman
,
L. M.
(
1967
).
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
.
USSR Computational Mathematics and Mathematical Physics
,
7
(
3
),
200
217
.
Crammer
,
K.
,
Kulesza
,
A.
, &
Dredze
,
M.
(
2009
).
Adaptive regularization of weight vectors
. In
Y.
Bengio
,
D.
Schuurmans
,
J.
Lafferty
,
C. K. I.
Williams
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
22
(pp.
414
422
).
Haykin
,
S.
(
2002
).
.
Upper Saddle River, NJ
:
Prentice Hall
.
Hido
,
S.
,
Tsuboi
,
Y.
,
Kashima
,
H.
,
Sugiyama
,
M.
, &
Kanamori
,
T.
(
2011
).
Statistical outlier detection using direct density ratio estimation
.
Knowledge and Information Systems
,
26
(
2
),
309
336
.
Izbicki
,
R.
,
Lee
,
A.
, &
Schafer
,
C.
(
2014
).
High-dimensional density ratio estimation with extensions to approximate likelihood computation
. In
Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
(pp.
420
429
).
N.p.: JMLR
.
Kanamori
,
T.
,
Hido
,
S.
, &
Sugiyama
,
M.
(
2009
).
A least-squares approach to direct importance estimation
.
Journal of Machine Learning Research
,
10
,
1391
1445
.
Nguyen
,
X.
,
Wainwright
,
M. J.
, &
Jordan
,
M. I.
(
2010
).
Estimating divergence functionals and the likelihood ratio by convex risk minimization
.
IEEE Transactions on Information Theory
,
56
(
11
),
5847
5861
.
Sugiyama
,
M.
, &
Kawanabe
,
M.
(
2012
).
Machine learning in non-stationary environments: Introduction to covariate shift adaptation
.
Cambridge, MA
:
MIT Press
.
Sugiyama
,
M.
,
Suzuki
,
T.
, &
Kanamori
,
T.
(
2012a
).
Density ratio estimation in machine learning
.
Cambridge
:
Cambridge University Press
.
Sugiyama
,
M.
,
Suzuki
,
T.
, &
Kanamori
,
T.
(
2012b
).
Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation
.
Annals of the Institute of Statistical Mathematics
,
64
(
5
),
1009
1044
.
Vapnik
,
V. N.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.

Notes

1
For matrix and vector , it holds that
2

In practice, a common strategy is to rank the unlabeled samples according to the score and then remove a percentage of the samples with the lowest score. This percentage is specified by the practitioner based on domain knowledge.

3

The data set was obtained from http://archive.ics.uci.edu/ml/.

4

The data set was obtained from http://yann.lecun.com/exdb/mnist/.