## Abstract

Many machine learning problems, such as nonstationarity adaptation, outlier detection, dimensionality reduction, and conditional density estimation, can be effectively solved by using the ratio of probability densities. Since the naive two-step procedure of first estimating the probability densities and then taking their ratio performs poorly, methods to directly estimate the density ratio from two sets of samples without density estimation have been extensively studied recently. However, these methods are batch algorithms that use the whole data set to estimate the density ratio, and they are inefficient in the online setup, where training samples are provided sequentially and solutions are updated incrementally without storing previous samples. In this letter, we propose two online density-ratio estimators based on the adaptive regularization of weight vectors. Through experiments on inlier-based outlier detection, we demonstrate the usefulness of the proposed methods.

## 1 Introduction

Almost all machine learning problems can be solved through density estimation, because knowing the probability density is equivalent to knowing everything about the data. Thus, density estimation is the most versatile approach to machine learning, and various methods have been proposed so far. However, without strong parametric assumptions, density estimation is hard to perform accurately in high-dimensional problems. Thus, it is desirable to solve target machine learning tasks directly without performing density estimation (Vapnik, 1998).

Following this idea, the machine learning approach based on the ratio of probability densities has attracted attention recently (Sugiyama, Suzuki, & Kanamori, 2012a). The rationale behind density ratio estimation is that many machine learning problems, such as transfer learning, outlier detection, change detection, and dimension reduction, can be solved in a unified manner using just the density ratio. By directly estimating this density ratio, the difficult task of density estimation can be avoided, leading to better empirical performance.

Due to the practical utility, several methods for density ratio estimation have been proposed (Kanamori, Hido, & Sugiyama, 2009; Sugiyama, Suzuki, & Kanamori, 2012b; Izbicki, Lee, & Schafer, 2014). In Sugiyama et al. (2012b) showed that density-ratio estimation may be performed by matching the density ratio to a model of the density ratio under a Bregman divergence. This view of density-ratio estimation is of great interest, since it relates to several existing methods, and the resulting estimators can be interpreted in terms of the Bregman divergence. The simplest Bregman divergence corresponds to the squared loss between the density ratio and its model (Kanamori et al., 2009). The main advantage of the least-squares-based density-ratio estimator is that the solution can be analytically obtained. Another choice for the Bregman divergence is the Kullback-Leibler-loss. This Kullback-Leibler-based estimator also appears in the variational estimation of the Kullback-Leibler divergence (Nguyen, Wainwright, & Jordan, 2010). The main disadvantage of the Kullback-Leibler-based estimator is that it does not have a closed-form solution, and optimization is usually performed via gradient or quasi-Newton methods. Although these estimators have been demonstrated to work well on many different problems (Sugiyama et al., 2012a; Sugiyama & Kawanabe, 2012), they work only in a batch mode and thus are not efficient in online problems where training samples are provided sequentially and solutions are updated incrementally without storing previous samples.

In this letter, we propose online algorithms of the least-squares and Kullback-Leibler-based density-ratio estimators in the framework of adaptive regularization of weight vectors (Crammer, Kulesza, & Dredze, 2009), which was originally proposed for regression and classification. We experimentally demonstrate that for a fixed computational budget, our proposed online algorithms achieve greater performance than both the batch solutions and online solutions via a naive stochastic gradient descent in inlier-based outlier detection.

## 2 Batch Density-Ratio Estimation

In this section, we formulate the batch density-ratio estimation problem and review density-ratio estimation using Bregman divergences.

### 2.1 Problem Formulation

A naive approach is to estimate and from and separately and take the ratio of the estimated densities. However, such a two-step plug-in approach is not reliable because the first step of density estimation is performed without regard to the second step of taking the ratio (Sugiyama et al., 2012a). Below, we review a direct density-ratio estimation method that does not involve density estimation.

### 2.2 Batch Algorithm

Here, we review batch density-ratio estimation algorithms using Bregman divergences.

#### 2.2.1 General Framework with Bregman Divergences

*t*is defined as (Bregman, 1967) where is a strictly convex function. The above is minimized with respect to

*t*when . The above distance is useful, since occurs linearly in all terms that include both and

*t*. Minimizing the Bregman divergence between the true density ratio and a model of the density ratio , weighted by , gives where is constant with regard to . The empirical version for the above problem is therefore where we included a regularization term with regularization parameter .

A final consideration is choosing a suitable Bregman divergence by defining the function . We discuss two such choices for .

#### 2.2.2 Kullback-Leibler Approach

#### 2.2.3 Least-Squares Approach

## 3 Online Density-Ratio Estimation

In this section, we consider an online learning setup where samples and following and are given sequentially at time step *t*. We first propose an online KL-based density‐ratio estimator and then an LS-based online density-ratio estimator.

### 3.1 Online KL Density-Ratio Estimation

### 3.2 Online LS Density Ratio Estimation

Note that the above online LS density-ratio estimator can be regarded as an application of classical recursive least-squares (Haykin, 2002) to density-ratio estimation.

## 4 Experiments

In this section, we experimentally investigate the performance of the proposed online density-ratio estimators on the problem of online inlier-based outlier detection (Hido, Tsuboi, Kashima, Sugiyama, & Kanamori, 2011).

### 4.1 Setup

In the experiments that follow, we first prepare training and test sets that both contain inliers and outliers. Then we form the inlier set from the inliers contained in the training set and form the unlabeled set from both the inliers and outliers in the training set.

We initially give to outlier detectors inlier samples and unlabeled samples randomly chosen from the inlier and unlabeled sets, respectively. Then a pair of randomly chosen inlier and unlabeled samples is given to outlier detectors in an online manner over iterations. In practice, the class prior of the mixture data set is unknown. Therefore, the score calculated by cannot be appropriately thresholded.^{2} Therefore, instead of the classification accuracy, the performance of outlier detectors is evaluated by the area under the receiver operating characteristic curve (AUC) for the test set. The AUC is used since it is independent of the particular class prior of the unlabeled data set or threshold.

- •
The batch KL method (Batch-KL) described in section 2

- •
The online KL method (AROW-KL) proposed in section 3

- •
The online KL method with naive stochastic gradient descent (SGD-KL)

- •
The batch LS method (Batch-LS) described in section 2

- •
The online LS method (AROW-LS), described in section 3

- •
The online LS method with naive stochastic gradient descent (SGD-LS)

Density-ratio estimators contain unknown hyperparameters. Cross-validation is the standard way to select these hyperparameter values. However, previous samples are not stored in the online methods, and therefore cross-validation cannot be directly employed. Here, we maintain all models with different hyperparameter values throughout the online learning process and use newer samples for validation to choose the best model at each time step. More specifically, at time , estimation is carried out using samples only up to time *t*, and the latest samples are used for computing the validation error with respect to the objective function. For fair comparison, we also use the same hyperparameter selection scheme for the batch methods.

### 4.2 Spambase Data Set

First, we perform experiments using the Spambase data set, which contains e-mail samples with attributes.^{3} There are spam samples in the data set. Among attributes, we use only (the percentage of words in an e-mail) for outlier detection. Twenty-five percent of the data set is used for evaluation, and training data are the remaining . It is assumed that all the positive samples are inliers. The probability that an outlier occurs is set to .

Figure 3a (left) depicts the AUC values as a function of the sample size. This figure shows that the batch methods are generally more accurate than their online counterparts. This is expected since the batch methods have access to all samples. The proposed online methods are not much worse than their batch counterparts and much better than naive stochastic gradient descent–based online methods.

Another interesting thing to note is that the batch KL method is generally better than the batch LS method. Figure 1 shows that outliers occur in areas where the density ratio is small. In Figure 4 the KL divergence and squared loss is plotted. From these plots, we can confirm that the KL divergence penalizes error more severely when the density ratio is small. This may contribute to the greater accuracy in identifying outliers.

Figure 3b (left) depicts the cumulative computation time as a function of the sample size, showing that the computation time of the online methods is significantly lower than the batch methods. Stochastic gradient descent–based online methods are faster than the proposed AROW-based online methods, perhaps since the proposed methods contain matrix that is updated at each time step. This may be mitigated by approximating by a diagonal matrix, as in the original AROW paper (Crammer et al., 2009). Due to the fact that the LS method has an analytic solution, the batch LS method is much faster than the batch KL method. This aspect of the LS method is a major motivation that it is often preferred over the KL method. However, we see that in the online setup, both the KL and LS methods are about the same speed.

In Figure 3c (left), we plot the cumulative computation time against the AUC values. From this, we see that with a limited computational budget, the proposed online methods significantly outperform both their batch and the stochastic gradient descent counterparts.

### 4.3 MNIST Data Set

Next, we use the MNIST data set, which contains images of handwritten digits of size pixels.^{4} Each image is represented by a -dimensional feature vector that is much higher dimensional than the previous Spambase data set. Each pixel in the images is normalized to , representing its gray-scale intensity level.

For the first experiment, we use the images of 4 or 9 in the data sets and regard 4 as inliers and 9 as outliers. Twenty-five percent of the data set is used for evaluation, and the remaining is training data. The probability that an outlier is drawn in the unlabeled data set was set to .

The experimental results are given in the right-hand column of Figure 3. From the graphs, we see that the results show similar tendencies to the previous results. That is, the proposed online methods are significantly faster than their batch versions. Furthermore, due to the better estimate of the density ratio, the outlier detection accuracy is much higher than the online stochastic gradient methods. With a limited computational budget, the proposed methods also give higher classification accuracy than all other methods.

For the second experiment, we choose one digit as an inlier class and regard all other digits as outliers. The initial inlier data set contained samples and the unlabeled data set samples. The probabilty that an outlier occurs in the unlabeled data set was set to . This experiment was performed by selecting 1–9 in turn as inliers (i.e., nine separate experiments). The AUC versus cumulative computation time is plotted in Figure 5. From the graphs, we again see that for a limited computational budget, the proposed online KL divergence gives more accurate results.

## 5 Conclusion

Various machine learning problems can be solved using density-ratio estimation, which can be performed by matching the density ratio to a model under a Bregman divergence. Two popular approaches are to use the Kullback-Leibler loss and the squared loss. In this letter, we extended the original batch density-ratio estimators to an online learning scenario based on the idea of adaptive regularization of weight vectors, which has been successfully used in regression and classification (Crammer et al., 2009). Through experiments on inlier-based outlier detection (Hido et al., 2011), we demonstrated the usefulness of the proposed methods. We showed that for a given computational budget, online AROW-based methods outperform both online stochastic gradient–descent and batch methods. We also showed that the KL divergence–based loss may be more suited to the outlier detection problem.

## Acknowledgments

We thank Tomoya Sakai for his valuable comments. M.C.dP. was supported by the JST CREST project, and M.S. was supported by KAKENHI 25700022.

## References

## Notes

^{2}

In practice, a common strategy is to rank the unlabeled samples according to the score and then remove a percentage of the samples with the lowest score. This percentage is specified by the practitioner based on domain knowledge.

^{3}

The data set was obtained from http://archive.ics.uci.edu/ml/.

^{4}

The data set was obtained from http://yann.lecun.com/exdb/mnist/.