Abstract

Log-density gradient estimation is a fundamental statistical problem and possesses various practical applications such as clustering and measuring nongaussianity. A naive two-step approach of first estimating the density and then taking its log gradient is unreliable because an accurate density estimate does not necessarily lead to an accurate log-density gradient estimate. To cope with this problem, a method to directly estimate the log-density gradient without density estimation has been explored and demonstrated to work much better than the two-step method. The objective of this letter is to improve the performance of this direct method in multidimensional cases. Our idea is to regard the problem of log-density gradient estimation in each dimension as a task and apply regularized multitask learning to the direct log-density gradient estimator. We experimentally demonstrate the usefulness of the proposed multitask method in log-density gradient estimation and mode-seeking clustering.

1  Introduction

Multitask learning is a paradigm of machine learning for solving multiple related learning tasks simultaneously with the expectation that information brought by other related tasks can be mutually exploited to improve accuracy (Caruana, 1997). Multitask learning is particularly useful when one has many related learning tasks to solve but only few training samples are available for each task, which is often the case in many real-world problems such as therapy screening (Bickel, Bogojeska, Lengauer, & Scheffer, 2008) and face verification (Wang, Zhang, & Zhang, 2009).

Multitask learning has been attracting a great deal of attention, and extensive studies have been conducted both theoretically and experimentally (Thrun, 1996; Evgeniou & Pontil, 2004; Ando & Zhang, 2005; Zhang, 2013; Baxter, 2000). Thrun (1996) proposed the lifelong learning framework, which transfers the knowledge obtained from tasks experienced in the past to a newly given task, and it was demonstrated to improve the performance of image recognition. Baxter (2000) defined a multitask learning framework, inductive bias learning, and derived a generalization error bound. The semisupervised multitask learning method proposed by Ando and Zhang (2005) generates many auxiliary learning tasks from unlabeled data and seeks good feature mapping for target learning tasks. Among various methods of multitask learning, one of the simplest and most practical approaches is regularized multitask learning (Evgeniou & Pontil, 2004; Evgeniou, Micchelli, & Pontil, 2005), which uses a regularizer that imposes the solutions of related tasks to be close to each other. Thanks to its generic and simple formulation, regularized multitask learning has been applied to various types of learning problems, including regression and classification (Evgeniou & Pontil, 2004; Evgeniou et al., 2005). In this letter, we explore a novel application of regularized multitask learning to the problem of log-density gradient estimation (Beran, 1976; Cox, 1985; Sasaki, Hyvärinen, & Sugiyama, 2014).

The goal of log-density gradient estimation is to estimate the gradient of the logarithm of an unknown probability density function using samples following it. Log-density gradient estimation has various applications, such as clustering (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002; Sasaki et al., 2014), measuring nongaussianity (Huber, 1985), and other fundamental statistical topics (Singh, 1977).

Beran (1976) proposed a method for directly estimating gradients without going through density estimation, which we refer to as least-squares log-density gradients (LSLDG). This direct method was experimentally shown to outperform the naive one consisting of density estimation followed by log-gradient computation, and it was demonstrated to be useful in clustering (Sasaki et al., 2014).

The objective of this letter is to estimate log-density gradients more accurately in multidimensional cases, still a challenging topic even using LSLDG. It is important to note that since the output dimensionality of the log-density gradient is the same as its input dimensionality d, multidimensional log-density gradient estimation can be regarded as having multiple learning tasks if we regard estimation of each output dimension as a task. Based on this view, we propose in this letter to apply regularized multitask learning to LSLDG. We also provide a practically useful design of parametric models for successfully applying regularized multitask learning to log-density gradient estimation. We experimentally demonstrate that the accuracy of LSLDG can be significantly improved by the proposed multitask method in multidimensional log-density estimation problems and that a mode-seeking clustering method based on the proposed method outperforms other methods.

The organization of this letter is as follows. In section 2, we formulate the problem of log-density gradient estimation and review LSLDG. section 3 reviews the core idea of regularized multitask learning. section 4 presents our proposed log-density gradient estimator and algorithms for computing the solution. In section 5, we experimentally demonstrate that the proposed method performs well on both artificial and benchmark data. Application to mode-seeking clustering is given in section 6. section 7 concludes this letter with potential extensions of this work.

2  Log-Density Gradient Estimation

In this section, we formulate the problem of log-density gradient estimation and then review LSLDG.

2.1  Problem Formulation and a Naive Method

Suppose that we are given a set of samples, , that are independent and identically distributed from a probability distribution with unknown density on . The problem is to estimate the gradient of the logarithm of the density from ,
formula
where denotes the partial derivative operator for .
A naive method for estimating the log-density gradient is to first estimate the probability density, which is performed by, for example, kernel density estimation (KDE) as
formula
where denotes the gaussian bandwidth, and then to take the gradient of the logarithm of as
formula

However, this two-step method does not work well because an accurate density estimate does not necessarily provide an accurate log-density gradient estimate. For example, Figure 1 illustrates that a worse (or better) density estimate can produce a better (or worse) gradient estimate.

Figure 1:

A comparison of two log-density gradient estimates based on density estimation. (a) is a better estimate to the true density p than . (b) is a better estimate to the true log-density gradient than .

Figure 1:

A comparison of two log-density gradient estimates based on density estimation. (a) is a better estimate to the true density p than . (b) is a better estimate to the true log-density gradient than .

To overcome this problem, LSLDG, a single-step method that directly estimates the gradient without going through density estimation, was proposed (Beran, 1976; Cox, 1985; Sasaki et al., 2014), and has been demonstrated to experimentally work well. Next, we review LSLDG.

2.2  Direct Estimation of Log-Density Gradients

The basic idea of LSLDG is to directly fit a model to the true log-density gradient under the squared loss,
formula
where is a constant that does not depend on gj, denotes integration except for , and the last deformation comes from integration by parts under the mild condition that as .
Then the LSLDG score is given as an empirical approximation to the risk subtracted by Cj:
formula
2.1
As , a linear-in-parameter model is used,
formula
2.2
where is a parameter, is a differentiable basis function, and b is the number of the basis functions. By substituting equation 2.2 into 2.1 and adding an -regularizer, we can analytically obtain the optimal solution as
formula
where is the regularization parameter, is the identity matrix, and
formula
Finally, an estimator of the log-density gradient is obtained by
formula

It has been experimentally shown that LSLDG produces much more accurate estimates of log-density gradients than the KDE-based gradient estimator and that the clustering method based on LSLDG performs well (Sasaki et al., 2014).

3  Regularized Multitask Learning

In this section, we review a multitask learning framework, regularized multitask learning (Evgeniou & Pontil, 2004; Evgeniou et al., 2005), which is powerful and widely applicable to many machine learning methods.

Consider that we have T tasks of supervised learning as follows. The task t is to learn an unknown function from samples of input-output pairs , where is the output with noise at the input . When is modeled by a parameterized function , learning is performed by finding the parameter , which minimizes the empirical risk associated with some loss function :
formula
where .
In regularized multitask learning, the objective function has regularization terms that impose every pair of parameters to be close to each other while are jointly minimized,
formula
where is the regularization parameter and are the similarity parameters between tasks t and .

It was experimentally demonstrated that the multitask support vector regression (Evgeniou & Pontil, 2004; Evgeniou et al., 2005) performs better than the single-task counterpart (Vapnik, Golowich, & Smola, 1997), especially when the tasks are highly related to each other.

4  Proposed Method

In this section, we present our proposed method and algorithms.

4.1  Basic Idea

Our goal in this letter is to improve the performance of LSLDG in multidimensional cases. For multidimensional input , the log-density gradient has multiple output dimensions, meaning that its estimation consists of multiple learning tasks. Our basic idea is to apply regularized multitask learning to solve these tasks simultaneously instead of learning them independently.

This idea is supported by the fact that the target functions of these tasks, , are all derived from the same log-density , and thus they must be strongly related to each other. Under such strong relatedness, jointly learning them with sharing information with each other would improve estimation accuracy, as has been observed in other multitask learning work.

4.2  Regularized Multitask Learning for Least-Squares Log-density Gradients

Here, we propose the regularized multitask learning for least-squares log-density gradients (MT-LSLDG) method. This method is given by applying regularized multitask learning to LSLDG. Specifically, we consider the problem of minimizing the following objective function,
formula
4.1
where the last term is the multitask regularizer that imposes the parameters to be close to each other.
Denoting the minimizers of equation 4.1 by , the estimator is given by, for ,
formula
4.2

4.3  Design of the Basis Functions

The design of the basis functions in MT-LSLDG is crucial to realize the advantage of regularized multitask learning. A simple design would be to use a common function for all , that is, . From equations 4.1 and 4.2, in this design, the multitask regularizer promotes to be closer to each other so that
formula
However, it is inappropriate for all to be similar because the different true partial derivatives, say, and for , show different profiles in general.
To avoid this problem, we propose using the partial derivatives of as basis functions:
formula
For this basis function design, it holds that
formula
Thus, are approximations of the true log-density . Since the multitask regularizer encourages the log-density estimates to be more similar, this basis design would be reasonable.
As a specific choice of , we use a gaussian kernel,
formula
where are the centers of the kernels and is the gaussian bandwidth.

4.4  Hyperparameter Tuning

As in LSLDG, the hyperparameters, which are the -regularization parameters , the gaussian bandwidth , and the multitask parameters , can be cross-validated in MT-LSLDG. The procedure of the K-fold cross-validation is as follows. First, we randomly partition the set of training samples into K folds . Next, for each , we estimate the log-density gradient using the samples in , which is denoted by , and then calculate the LSLDG scores for the samples in Fk as :
formula
We average these LSLDG scores to obtain the K-fold cross-validated LSLDG score:
formula
Finally, we choose the hyperparameters that minimize . Throughout this letter, we set K = 5.

When the number of similarity parameters is large (i.e., the data dimensionality is high), cross-validation may be computationally inefficient. In this case, we can use the heuristic procedure that alternatingly updates the estimates and the similarity parameters as described in algorithm 1, where is a hyperparameter to be selected by cross-validation. In the update formula, equation 4.3, the procedure determines the similarity parameter to be used in the next iteration depending on how close the estimated parameters and are.

formula

4.5  Optimization Algorithms in MT-LSLDG

Here, we develop two algorithms for minimizing equation 4.1. One algorithm is to directly evaluate the analytic solution, and the other is an iterative method based on block coordinate descent (Warga, 1963).

4.5.1  Analytic Solution

For simplicity, we assume that the similarity parameters are symmetric: . Then the objective function can be expressed as a quadratic function in terms of as
formula
where
formula
, is the block-diagonal matrix whose diagonal blocks are its arguments, and denotes the Kronecker product. The minimizer of is analytically computed by
formula
4.4

4.5.2  Block Coordinate Descent Method

Direct computation of the analytic solution, equation 4.4, involves inversion of a matrix. This may be not only expensive in terms of computation time but also infeasible in terms of memory space when the dimensionality d is very large.

Alternatively, we propose an algorithm based on block coordinate descent (BCD) (Warga, 1963). It is an iterative algorithm that needs only manipulation of a relatively small matrix at each iteration. This alleviates the memory size requirement and, it is hoped, reduces computation time if the number of iterations is not large.

formula

A pseudocode of the algorithm is shown in algorithm 2. At each update, equation 4.5 in the algorithm, only one vector is optimized in a closed form while fixing the other parameters (). This update requires only computing the inverse of a matrix, which seems to be computationally advantageous over evaluating the analytic solution in terms of the computation cost and memory size requirement.

Another important technique to reduce the overall computation time is to use warm start initialization: when the optimal value of is searched for by cross-validation, we may use the solutions obtained with as initial values for another .

5  Experiments on Log-Density Gradient Estimation

In this section, we illustrate the behavior of the proposed method and experimentally investigate its performance.

5.1  Experimental Setting

In each experiment, training samples and test samples are drawn independently from an unknown density . We estimate from the training samples and then evaluate the estimation performance by the test score,
formula
where is an estimated log-density gradient. This score is an empirical approximation of the expected squared loss of over the test samples without the constant Cj (see section 2.2), and a smaller score means a better estimate.

We compare the following three methods:

  • The multitask LSLDG (MT-LSLDG)—our method proposed in section 4.

  • The single-task LSLDG (S-LSLDG)—the existing method (Beran, 1976; Cox, 1985) reviewed in section 2.2. This method agrees with MT-LSLDG at .

  • The common-parameter LSLDG (C-LSLDG)—LSLDG with common parameters learned simultaneously. The solution is given as
    formula
    where is the -regularization parameter. This method agrees with MT-LSLDG at the limit .

In all the methods, we set the number of basis functions as and randomly choose the kernel centers uniformly from training samples without replacement. For hyperparameters, we use the common -regularization parameter and bandwidth parameter among all the dimensions: and . We also set all the similarity parameters as , which assumes that all dimensions are equally related to each other. In order to examine whether this assumption is reasonable, we experimentally compare MT-LSLDG with this assumption and that with the similarity parameter tuning (see algorithhm 1) in section 5.2.

5.2  Artificial Data

We conduct numerical experiments on artificial data to investigate the basic behavior of MT-LSLDG. As data density , we consider the following two cases:

  • Single gaussian: The d-dimensional gaussian density whose mean is 0 and whose covariance matrix is the diagonal matrix in which the first half of the diagonal elements are 1 and the others are 5.

  • Double gaussian: A mixture of two d-dimensional gaussian densities with mean zero and and identity covariance matrix. The mixing coefficients are .

The dimensionality d and sample size n are specified later. First, we investigate whether MT-LSLDG improves the estimation accuracy of LSLDG at appropriate . We prepare data sets with different dimensionalities and sample sizes . MT-LSLDG is applied to the data sets at each . The gaussian bandwidth and the -regularization parameter are chosen by five-fold cross-validation as described in section 4 from the candidate lists and , respectively. The solution of MT-LSLDG is computed analytically as in equation 4.4.

The results are plotted in Figure 2. In the figure, the relative test score is defined as the test score from which the test score of S-LSLDG is subtracted, and thus negative relative scores indicate that MT-LSLDG improved the performance of S-LSLDG. When , MT-LSLDG does not improve the performance for any values (see Figures 2a and 2b). However, for higher-dimensional data, performance is improved at appropriate values (e.g., for in Figure 2a and for in Figure 2b). Similar improvement is observed also for the smaller sample sizes (e.g., and n = 30) in Figures 2c and Figure 2d.

Figure 2:

Averages (and standard errors) of relative test scores over 100 runs. The relative test scores refer to test scores from which the test score of S-LSLDG is subtracted. The black dotted lines indicate the relative score zero.

Figure 2:

Averages (and standard errors) of relative test scores over 100 runs. The relative test scores refer to test scores from which the test score of S-LSLDG is subtracted. The black dotted lines indicate the relative score zero.

These results confirm that MT-LSLDG improves the performance of S-LSLDG at an appropriate value when data are relatively high-dimensional and the sample size is small. Since such is usually unknown in advance, we need to find a reasonable value in practice.

Next, we investigate whether an appropriate value can be chosen by cross-validation. In this experiment, the cross-validation method in section 4.3 is performed to choose as well. The candidates of are . The other experimental settings such as data generation and all the LSLDGs are the same as in the previous experiment except that we also run MT-LSLDG with similarity parameter tuning by algorithm 1 with cross-validated . The candidate list for is .

Table 1 shows that MT-LSLDG improves performance especially when the dimensionality of data is relatively high and the sample size is small. These results indicate that the proposed cross-validation method allows us to choose a reasonable value. In most cases of the experiments, MT-LSLDG with gives estimation accuracy comparable to that with the similar parameter tuning procedure. In the remaining experiments, we use cross-validated and the fixed .

Table 1:
Averages and Standard Errors of Test Scores on the Artificial Data with Cross-Validation over 100 Runs.
DensitynMT-LSLDLGMT-LSLDG-TS-LSLDGC-LSLDG
Single gaussian 10 2.87 (0.22) −2.33 (0.37) 0.37 (0.31) 2.58 (0.07) 
 30 5.34 (0.04) 5.38 (0.02) −4.97 (0.08) −3.29 (0.03) 
 50 5.63 (0.02) 5.64 (0.01) −5.55 (0.02) −4.13 (0.04) 
Double gaussian 10 6.83 (0.14) 6.80 (0.17) 1.01 (0.54) −5.02 (0.12) 
 30 8.45 (0.03) 8.45 (0.07) −7.63 (0.10) −7.84 (0.04) 
 50 −8.67 (0.02) 8.71 (0.03) −8.29 (0.10) −8.48 (0.02) 
Density d MT-LSLDG MT-LSLDG-T S-LSLDG C-LSLDG 
Single gaussian  (0.19) 0.21 (0.15) 0.11 (0.15)  (0.15) 
 10 5.34 (0.04) 5.38 (0.02) −4.97 (0.08) −3.29 (0.03) 
 20 10.77 (0.03) 10.76 (0.04) −9.98 (0.13) −6.39 (0.01) 
Double gaussian  (0.22)  (0.25)  (0.27)  (0.22) 
 10 8.45 (0.03) 8.45 (0.07) −7.63 (0.10) −7.84 (0.04) 
 20 16.88 (0.14) 17.06 (0.125) −14.90 (0.10) −15.26 (0.06) 
DensitynMT-LSLDLGMT-LSLDG-TS-LSLDGC-LSLDG
Single gaussian 10 2.87 (0.22) −2.33 (0.37) 0.37 (0.31) 2.58 (0.07) 
 30 5.34 (0.04) 5.38 (0.02) −4.97 (0.08) −3.29 (0.03) 
 50 5.63 (0.02) 5.64 (0.01) −5.55 (0.02) −4.13 (0.04) 
Double gaussian 10 6.83 (0.14) 6.80 (0.17) 1.01 (0.54) −5.02 (0.12) 
 30 8.45 (0.03) 8.45 (0.07) −7.63 (0.10) −7.84 (0.04) 
 50 −8.67 (0.02) 8.71 (0.03) −8.29 (0.10) −8.48 (0.02) 
Density d MT-LSLDG MT-LSLDG-T S-LSLDG C-LSLDG 
Single gaussian  (0.19) 0.21 (0.15) 0.11 (0.15)  (0.15) 
 10 5.34 (0.04) 5.38 (0.02) −4.97 (0.08) −3.29 (0.03) 
 20 10.77 (0.03) 10.76 (0.04) −9.98 (0.13) −6.39 (0.01) 
Double gaussian  (0.22)  (0.25)  (0.27)  (0.22) 
 10 8.45 (0.03) 8.45 (0.07) −7.63 (0.10) −7.84 (0.04) 
 20 16.88 (0.14) 17.06 (0.125) −14.90 (0.10) −15.26 (0.06) 

Notes: MT-LSLDG-T in the table refers to MT-LSLDG with similarity parameter tuning by algorithm 1. In each row, the best and comparable-to-the-best scores in terms of paired t-test with significance level 5% are in bold.

5.3  Benchmark Data

In this section, we demonstrate the usefulness of MT-LSLDG in gradient estimation on various benchmark data sets. This experiment uses some IDA benchmark data sets (Rätsch, Onoda, & Müller, 2001) and UCI benchmark data sets (Catlett, 1991; Kaya, Tüfekci, & Gürgen, 2012; Lucas et al., 2013; Tüfekci, 2014; Lichman, 2013). All the data sets are standardized in advance.

For MT-LSLDG, the hyperparameters , and are chosen by cross-validation. The candidate lists are , , and , respectively. For S-LSLDG and C-LSLDG, the candidate lists of and are the same as MT-LSLDG. The solution of MT-LSLDG is computed by the BCD algorithm described in section 4.5.2.

The results are presented in Table 2. MT-LSLDG significantly improves the performance of either S-LSLDG or C-LSLDG on most of the data sets.

Table 2:
Averages and Standard Errors of Test Scores on the Benchmark Data Sets.
Data Set (d, n)MT-LSLDGS-LSLDGC-LSLDG
Thyroid (5, 140)    
 (((
CCPP (5, 200)    
 (((
Diabetes (8, 468)    
 (((
Flare-solar (9, 666)    
 (((
Breast-cancer (9, 200)    
 (((
Shuttle (9, 1000)    
 (((
Image (18, 1300)    
 (((
Popfailures (18, 50)    
 (((
German    
 (((
Twonorm (20, 400)    
 (((
Waveform (21, 400)    
 (((
Splice (60, 1000)    
 (((
Data Set (d, n)MT-LSLDGS-LSLDGC-LSLDG
Thyroid (5, 140)    
 (((
CCPP (5, 200)    
 (((
Diabetes (8, 468)    
 (((
Flare-solar (9, 666)    
 (((
Breast-cancer (9, 200)    
 (((
Shuttle (9, 1000)    
 (((
Image (18, 1300)    
 (((
Popfailures (18, 50)    
 (((
German    
 (((
Twonorm (20, 400)    
 (((
Waveform (21, 400)    
 (((
Splice (60, 1000)    
 (((

Notes: In each data set, the best and comparable-to-the-best-scores in terms of paired t-test with significance level 5% are in bold. The number of trials is 20 for the image and splice data set and 100 for the other data sets.

We also run MT-LSLDG for three different sample sizes n for each of the data sets with changing from 0 to . Figure 3 summarizes the results of the experiments. The plots show that performance depends highly on and that the best differs from one data set to another, which means that it is very important to select a good by cross-validation. Also, we can see that the blue plot, for the smallest n, shows the largest improvement over S-LSLDG in most of the data sets. This implies that MT-LSLDG is especially advantageous when the sample size is small.

Figure 3:

Averages and standard errors of relative test scores on the IDA data sets and the other real data sets. The relative test scores refer to test scores from which the test score of S-LSLDG is subtracted. The black dotted lines indicate the relative score of zero.

Figure 3:

Averages and standard errors of relative test scores on the IDA data sets and the other real data sets. The relative test scores refer to test scores from which the test score of S-LSLDG is subtracted. The black dotted lines indicate the relative score of zero.

6  Application to Mode-Seeking Clustering

In this section, we apply MT-LSLDG to mode-seeking clustering and experimentally demonstrate its usefulness.

6.1  Mode-Seeking Clustering

A practical application of log-density gradient estimation is mode-seeking clustering (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002; Sasaki et al., 2014). Mode-seeking clustering methods update each data point toward a nearby mode by gradient ascent and assign the same clustering label to the data points that converged to the same mode (see Figure 4). Their notable advantage is that we need not specify the number of clusters in advance. Mode-seeking clustering has been successfully applied to a variety of real-world problems such as object tracking (Comaniciu, Ramesh, & Meer, 2000), image segmentation (Comaniciu & Meer, 2002; Sasaki et al., 2014), and line edge detection in images (Bandera, Pérez-Lorenzo, Bandera, & Sandoval, 2006).

Figure 4:

Transition of data points during a mode-seeking process. Data samples are drawn from a mixture of gaussians, and the data points sampled from the same gaussian component are specified by the same color (red, green, or blue) and marker (plus symbol, circle, or triangle). White squares indicate the points to which data points converged.

Figure 4:

Transition of data points during a mode-seeking process. Data samples are drawn from a mixture of gaussians, and the data points sampled from the same gaussian component are specified by the same color (red, green, or blue) and marker (plus symbol, circle, or triangle). White squares indicate the points to which data points converged.

In mode seeking, the essential ingredient is the gradient of the data density. To estimate the gradients, mean shift clustering (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002), one of the most popular mode-seeking clustering methods, employs the two-step method of first estimating the data density by kernel density estimation and then taking its gradient. However, as we mentioned earlier, this two-step method does not work well since accurately estimating the density does not necessarily lead to an accurate estimate of the gradient.

In order to overcome this problem, LSLDG clustering (Sasaki et al., 2014) adopted LSLDG instead of the two-step method. Sasaki et al. (2014) also provided a practically useful fixed-point algorithm for mode seeking as in mean shift clustering (Cheng, 1995): when the partial derivative of a vector of gaussian kernels is used as the vector of basis functions, the model can be transformed as
formula
where we assume that is nonzero. Setting to zero yields a fixed-point update formula as
formula
It has been experimentally shown that LSLDG clustering performs significantly better than mean-shift clustering (Sasaki et al., 2014).

Here, we apply MT-LSLDG to LSLDG clustering and investigate if the performance is improved in mode-seeking clustering, as well for relatively high-dimensional data.

6.2  Experiments

Next, we conduct numerical experiments for mode-seeking clustering.

6.2.1  Experimental Setting

We apply the following four clustering methods to various data sets:

  • MT-LSLDGC: LSLDG clustering with MT-LSLDG

  • S-LSLDGC: LSLDG clustering with S-LSLDG (Sasaki et al., 2014)

  • C-LSLDGC: LSLDG clustering with C-LSLDG (Sasaki et al., 2014)

  • Mean-shift: mean shift clustering (Comaniciu & Meer, 2002)

For MTL-, S-, and C-LSLDG, all the hyperparameters are cross-validated as described in section 4.3, and for mean-shift, log-likelihood cross-validation is used.

We evaluate the clustering performance by the adjusted Rand index (ARI) (Hubert & Arabie, 1985). ARI gives one to the perfect clustering assignment and zero on average to a random clustering assignment. A larger ARI value means a better clustering result.

6.2.2  Artificial Data

First, we conduct experiments on artificial data. The density of the artificial data is a mixture of three d-dimensional gaussian densities with means , , and , covariance matrices , and mixing coefficients 0.4, 0.3, 0.3. The candidate lists of the hyperparameters are , , and .

The results are shown in Table 3. We can see that MT-LSLDGC performs well, especially for the largest dimensionality .

Table 3:
Averages and Standard Errors of ARIs on Artificial Data.
dMT-LSLDGCS-LSLDGCC-LSLDGCMean-Shift
(0.035)  (0.125)  (0.036)  (0.044) 
10  (0.004)  (0.003)  (0.004) 0.042 (0.022) 
15  (0.023)  (0.054) 0.877 (0.217) 0.000 (0.000) 
20  (0.190) 0.586 (0.208) 0.716 (0.352) 0.036 (0.037) 
dMT-LSLDGCS-LSLDGCC-LSLDGCMean-Shift
(0.035)  (0.125)  (0.036)  (0.044) 
10  (0.004)  (0.003)  (0.004) 0.042 (0.022) 
15  (0.023)  (0.054) 0.877 (0.217) 0.000 (0.000) 
20  (0.190) 0.586 (0.208) 0.716 (0.352) 0.036 (0.037) 

Notes: In each row, the best and comparable to the best ARI in terms of unpaired t-test with significance level 5% is emphasized in bold. The number of trials is 100.

6.2.3  Real Data

Next, we perform clustering on real data. The following four data sets are used:

  • Accelerometry data: Five-dimensional data used in Hachiya, Sugiyama, and Ueda (2012) for human activity recognition extracted from mobile sensing data available from http://alkan.mns.kyutech.ac.jp/web/data. The number of classes is 3. In each run of experiment, we use 100 randomly chosen from each class. The total number of samples is 300.

  • Vowel data: Ten-dimensional data of recorded British English vo-wel sounds available from https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Vowel+Recognition+-+Deterding+Data). The number of classes is 11. In each run of experiment, we use 500 randomly chosen samples.

  • Sat-image data: Thirty-six-dimensional multispectral satellite image available from https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite). The number of classes is 6. In each run of experiment, we use 2000 randomly chosen samples.

  • Speech data: Fifty-dimensional voice data by two French speakers (Sugiyama, Niu, Yamada, Kimura, & Hachiya, 2014). The number of classes is 2. In each run of experiment, we use 200 randomly chosen samples from each class. The total number of samples is 400.

For MT-LSLDG, the hyperparameters are cross-validated using the candidates , , and except that we use relatively small candidate lists , , and for the speech data since they have large dimensionality and optimization is computationally expensive. For S- and C-LSLDG, we used the same candidates of MT-LSLDG for and . For mean shift clustering, the gaussian kernel is employed in KDE, and the bandwidth parameter in the kernel is selected by five-fold cross-validation with respect to the log likelihood of the density estimate from the same candidates of MT-LSLDG for .

The results are shown in Table 4. For the accelerometry data whose dimensionality is only five, S-LSLDGC gives the best performance. MT-LSLDGC does not improve the performance, although it performs better than C-LSLDGC.

Table 4:
Averages and Standard Errors of ARIs on Real Data.
Data Set ()MT-LSLDGCS-LSLDGCC-LSLDGCMean-Shift
Accelerometry (5, 300) 0.40 (0.01)  (0.02) 0.24 (0.01) 0.26 (0.04) 
Vowel (10, 500)  (0.00)  (0.00)  (0.00) 0.04 (0.00) 
Sat-image (36, 2000)  (0.00) 0.43 (0.01) 0.35 (0.00) 0.00 (0.00) 
Speech (50, 400)  (0.02) 0.00 (0.00) 0.15 (0.01) 0.00 (0.00) 
Data Set ()MT-LSLDGCS-LSLDGCC-LSLDGCMean-Shift
Accelerometry (5, 300) 0.40 (0.01)  (0.02) 0.24 (0.01) 0.26 (0.04) 
Vowel (10, 500)  (0.00)  (0.00)  (0.00) 0.04 (0.00) 
Sat-image (36, 2000)  (0.00) 0.43 (0.01) 0.35 (0.00) 0.00 (0.00) 
Speech (50, 400)  (0.02) 0.00 (0.00) 0.15 (0.01) 0.00 (0.00) 

Notes: In each row, the best and comparable-to-the-best ARI in terms of paired t-test with significance level 5% is in bold. The number of trials is 100 for the accelerometry data and the sat-image data and 20 for the speech data.

For the higher-dimensional data set, the vowel data, the sat-image data, and the speech data, the performance of MT-LSLDGC is the best or comparable to the best. These results indicate that MT-LSLDG is a promising method in mode-seeking clustering, especially when the dimensionality of data is relatively large.

7  Conclusion

We proposed a multitask log-density gradient estimator in order to improve the existing estimator in higher-dimensional cases. Our fundamental idea is to exploit the relatedness inherent in the partial derivatives through regularized multitask learning. As a result, we experimentally confirmed that our method significantly improves the accuracy of log-density gradient estimation. Finally, we demonstrated the usefulness of the proposed log-density gradient estimator in mode-seeking clustering.

Although fixing the similarity parameters to be 1 worked reasonably well in our experiments, carefully tuning them may improve the estimation accuracy. A good practice may be to use the heuristic procedure given in algorithm 1, whose properties have yet to be analyzed. Another way is to use Bayesian optimization techniques such as the gaussian process approaches studied in Bergstra, Bardenet, Bengio, and Kégl (2011) and Snoek, Larochelle, and Adams (2012), which have been experimentally shown to be reasonably fast even in large-scale hyperparameter tuning tasks.

As log-density gradient is a vector-valued function learning problem, one may consider applying kernel-based methods for such problems (Micchelli & Pontil, 2005; Caponnetto, Micchelli, Pontil, & Ying, 2008). Micchelli and Pontil (2005) showed a representer theorem for a wide class of optimizations in a reproducing kernel Hilbert space of vector-valued functions whose objective functional depends only on outputs of the function to be optimized. However, it may not be possible to directly employ those methods in the LSLDG framework since the LSLDG objective functional also depends on the gradients besides outputs of the function. It is an important open question whether LSLDG also admits a similar representer theorem.

Log-density gradient estimation would be useful in a measure for nongaussianity (Huber, 1985) and other further fundamental statistical topics (Singh, 1977). In future work, we will investigate the performance of our proposed method in these topics.

Acknowledgments

We thank the anonymous reviewers for their valuable comments. I.Y. was supported by KAKENHI 26280054, H.S. was supported by KAKENHI 15H06103, and M.S. was supported by KAKENHI 25700022.

References

Ando
,
R. K.
, &
Zhang
,
T.
(
2005
).
A framework for learning predictive structures from multiple tasks and unlabeled data
.
Journal of Machine Learning Research
,
6
,
1817
1853
.
Bandera
,
A.
,
Pérez-Lorenzo
,
J. M.
,
Bandera
,
J.
, &
Sandoval
,
F.
(
2006
).
Mean shift based clustering of Hough domain for fast line segment detection
.
Pattern Recognition Letters
,
27
(
6
),
578
586
.
Baxter
,
J.
(
2000
).
A model of inductive bias learning
.
Journal of Artificial Intelligence Research
,
12
,
149
198
.
Beran
,
R.
(
1976
).
Adaptive estimates for autoregressive processes
.
Annals of the Institute of Statistical Mathematics
,
28
(
1
),
77
89
.
Bergstra
,
J. S.
,
Bardenet
,
R.
,
Bengio
,
Y.
, &
Kégl
,
B.
(
2011
).
Algorithms for hyper-parameter optimization
. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
21
(pp. 
2546
2554
).
Red Hook, NY
:
Curran
.
Bickel
,
S.
,
Bogojeska
,
J.
,
Lengauer
,
T.
, &
Scheffer
,
T.
(
2008
).
Multi-task learning for HIV therapy screening
. In
Proceedings of the 25th International Conference on Machine Learning
(pp.
56
63
).
New York
:
ACM
.
Caponnetto
,
A.
,
Micchelli
,
C. A.
,
Pontil
,
M.
, &
Ying
,
Y.
(
2008
).
Universal multi-task kernels
.
Journal of Machine Learning Research
,
9
,
1615
1646
.
Caruana
,
R.
(
1997
).
Multitask learning
.
Machine Learning
,
28
(
1
),
41
75
.
Catlett
,
J.
(
1991
).
Inductive learning from subsets or disposal of excess training data considered harmful
.
Sydney
:
Basser Department of Computer Science, University of Sydney
.
Cheng
,
Y.
(
1995
).
Mean shift, mode seeking, and clustering
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
17
(
8
),
790
799
.
Comaniciu
,
D.
, &
Meer
,
P.
(
2002
).
Mean shift: A robust approach toward feature space analysis
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
24
(
5
),
603
619
.
Comaniciu
,
D.
,
Ramesh
,
V.
, &
Meer
,
P.
(
2000
).
Real-time tracking of non-rigid objects using mean shift
. In
IEEE Conference on Computer Vision and Pattern Recognition 2000
(
vol. 2
, pp.
142
149
).
Washington, DC
:
IEEE
.
Cox
,
D. D.
(
1985
).
A penalty method for nonparametric estimation of the logarithmic derivative of a density function
.
Annals of the Institute of Statistical Mathematics
,
37
(
1
),
271
288
.
Evgeniou
,
T.
,
Micchelli
,
C. A.
, &
Pontil
,
M.
(
2005
).
Learning multiple tasks with kernel methods
.
Journal of Machine Learning Research
,
6
,
615
637
.
Evgeniou
,
T.
, &
Pontil
,
M.
(
2004
).
Regularized multi-task learning
. In
Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
109
117
).
New York
:
ACM
.
Fukunaga
,
K.
, &
Hostetler
,
L.
(
1975
).
The estimation of the gradient of a density function, with applications in pattern recognition
.
IEEE Transactions on Information Theory
,
21
(
1
),
32
40
.
Hachiya
,
H.
,
Sugiyama
,
M.
, &
Ueda
,
N.
(
2012
).
Importance-weighted least-squares probabilistic classifier for covariate shift adaptation with application to human activity recognition
.
Neurocomputing
,
80
,
93
101
.
Huber
,
P. J.
(
1985
).
Projection pursuit
.
Annals of Statistics
,
13
(
2
),
435
475
.
Hubert
,
L.
, &
Arabie
,
P.
(
1985
).
Comparing partitions
.
Journal of Classification
,
2
(
1
),
193
218
.
Kaya
,
H.
,
Tüfekci
,
P.
, &
Gürgen
,
F. S.
(
2012
).
Local and global learning methods for predicting power of a combined gas and steam turbine
. In
Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering
.
N.p.
:
Planetary Scientific Research Center
.
Lichman
,
M.
(
2013
).
UCI machine learning repository
.
Irvine
:
University of California, School of Information and Computer Science
. http://archive.ics.uci.edu/ml
Lucas
,
D. D.
,
Klein
,
R.
,
Tannahill
,
J.
,
Ivanova
,
D.
,
Brandon
,
S.
,
Domyancic
,
D.
, &
Zhang
,
Y.
(
2013
).
Failure analysis of parameter-induced simulation crashes in climate models
.
Geoscientific Model Development
,
6
(
4
),
1157
1171
.
Micchelli
,
C. A.
, &
Pontil
,
M.
(
2005
).
On learning vector-valued functions
.
Neural Computation
,
17
(
1
),
177
204
.
Rätsch
,
G.
,
Onoda
,
T.
, &
Müller
,
K. R.
(
2001
).
Soft margins for Adaboost
.
Machine Learning
,
42
(
3
),
287
320
.
Sasaki
,
H.
,
Hyvärinen
,
A.
, &
Sugiyama
,
M.
(
2014
).
Clustering via mode seeking by direct estimation of the gradient of a log-density
. In
Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
(pp.
19
34
).
New York
:
Springer
.
Singh
,
R. S.
(
1977
).
Applications of estimators of a density and its derivatives to certain statistical problems
.
Journal of the Royal Statistical Society. Series B
,
39
(
3
),
357
363
.
Snoek
,
J.
,
Larochelle
,
H.
, &
Adams
,
R. P.
(
2012
).
Practical Bayesian optimization of machine learning algorithms
. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
2951
2959
).
Red Hook, NY
:
Curran
.
Sugiyama
,
M.
,
Niu
,
G.
,
Yamada
,
M.
,
Kimura
,
M.
, &
Hachiya
,
H.
(
2014
).
Information-maximization clustering based on squared-loss mutual information
.
Neural Computation
,
26
(
1
),
84
131
.
Thrun
,
S.
(
1996
).
Is learning the n-th thing any easier than learning the first?
In
D. S.
Touretzky
&
M. E.
Hasselmo
(Eds.),
Advances in neural information processing systems
,
8
(pp.
640
646
).
Cambridge, MA
:
MIT Press
.
Tüfekci
,
P.
(
2014
).
Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods
.
International Journal of Electrical Power and Energy Systems
,
60
,
126
140
.
Vapnik
,
V.
,
Golowich
,
S. E.
, &
Smola
,
A.
(
1997
).
Support vector method for function approximation, regression estimation, and signal processing
. In
M. J.
Jordan
&
T.
Petsche
(Eds.),
Advances in neural information processing systems
,
9
(pp.
281
287
).
Cambridge, MA
:
MIT Press
.
Wang
,
X.
,
Zhang
,
C.
, &
Zhang
,
Z.
(
2009
).
Boosted multi-task learning for face verification with applications to web image and video search
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009
(pp.
142
149
).
Washington, DC
:
IEEE
.
Warga
,
J.
(
1963
).
Minimizing certain convex functions
.
Journal of the Society for Industrial and Applied Mathematics
,
11
(
3
),
588
593
.
Zhang
,
Y.
(
2013
).
Heterogeneous-neighborhood-based multi-task local learning algorithms
. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
1896
1904
).
Red Hook, NY
:
Curran
.