## Abstract

Log-density gradient estimation is a fundamental statistical problem and possesses various practical applications such as clustering and measuring nongaussianity. A naive two-step approach of first estimating the density and then taking its log gradient is unreliable because an accurate density estimate does not necessarily lead to an accurate log-density gradient estimate. To cope with this problem, a method to directly estimate the log-density gradient without density estimation has been explored and demonstrated to work much better than the two-step method. The objective of this letter is to improve the performance of this direct method in multidimensional cases. Our idea is to regard the problem of log-density gradient estimation in each dimension as a task and apply regularized multitask learning to the direct log-density gradient estimator. We experimentally demonstrate the usefulness of the proposed multitask method in log-density gradient estimation and mode-seeking clustering.

## 1 Introduction

Multitask learning is a paradigm of machine learning for solving multiple related learning tasks simultaneously with the expectation that information brought by other related tasks can be mutually exploited to improve accuracy (Caruana, 1997). Multitask learning is particularly useful when one has many related learning tasks to solve but only few training samples are available for each task, which is often the case in many real-world problems such as therapy screening (Bickel, Bogojeska, Lengauer, & Scheffer, 2008) and face verification (Wang, Zhang, & Zhang, 2009).

Multitask learning has been attracting a great deal of attention, and extensive studies have been conducted both theoretically and experimentally (Thrun, 1996; Evgeniou & Pontil, 2004; Ando & Zhang, 2005; Zhang, 2013; Baxter, 2000). Thrun (1996) proposed the lifelong learning framework, which transfers the knowledge obtained from tasks experienced in the past to a newly given task, and it was demonstrated to improve the performance of image recognition. Baxter (2000) defined a multitask learning framework, inductive bias learning, and derived a generalization error bound. The semisupervised multitask learning method proposed by Ando and Zhang (2005) generates many auxiliary learning tasks from unlabeled data and seeks good feature mapping for target learning tasks. Among various methods of multitask learning, one of the simplest and most practical approaches is regularized multitask learning (Evgeniou & Pontil, 2004; Evgeniou, Micchelli, & Pontil, 2005), which uses a regularizer that imposes the solutions of related tasks to be close to each other. Thanks to its generic and simple formulation, regularized multitask learning has been applied to various types of learning problems, including regression and classification (Evgeniou & Pontil, 2004; Evgeniou et al., 2005). In this letter, we explore a novel application of regularized multitask learning to the problem of log-density gradient estimation (Beran, 1976; Cox, 1985; Sasaki, Hyvärinen, & Sugiyama, 2014).

The goal of log-density gradient estimation is to estimate the gradient of the logarithm of an unknown probability density function using samples following it. Log-density gradient estimation has various applications, such as clustering (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002; Sasaki et al., 2014), measuring nongaussianity (Huber, 1985), and other fundamental statistical topics (Singh, 1977).

Beran (1976) proposed a method for directly estimating gradients without going through density estimation, which we refer to as least-squares log-density gradients (LSLDG). This direct method was experimentally shown to outperform the naive one consisting of density estimation followed by log-gradient computation, and it was demonstrated to be useful in clustering (Sasaki et al., 2014).

The objective of this letter is to estimate log-density gradients more accurately in multidimensional cases, still a challenging topic even using LSLDG. It is important to note that since the output dimensionality of the log-density gradient is the same as its input dimensionality *d*, multidimensional log-density gradient estimation can be regarded as having multiple learning tasks if we regard estimation of each output dimension as a task. Based on this view, we propose in this letter to apply regularized multitask learning to LSLDG. We also provide a practically useful design of parametric models for successfully applying regularized multitask learning to log-density gradient estimation. We experimentally demonstrate that the accuracy of LSLDG can be significantly improved by the proposed multitask method in multidimensional log-density estimation problems and that a mode-seeking clustering method based on the proposed method outperforms other methods.

The organization of this letter is as follows. In section 2, we formulate the problem of log-density gradient estimation and review LSLDG. section 3 reviews the core idea of regularized multitask learning. section 4 presents our proposed log-density gradient estimator and algorithms for computing the solution. In section 5, we experimentally demonstrate that the proposed method performs well on both artificial and benchmark data. Application to mode-seeking clustering is given in section 6. section 7 concludes this letter with potential extensions of this work.

## 2 Log-Density Gradient Estimation

In this section, we formulate the problem of log-density gradient estimation and then review LSLDG.

### 2.1 Problem Formulation and a Naive Method

However, this two-step method does not work well because an accurate density estimate does not necessarily provide an accurate log-density gradient estimate. For example, Figure 1 illustrates that a worse (or better) density estimate can produce a better (or worse) gradient estimate.

### 2.2 Direct Estimation of Log-Density Gradients

*b*is the number of the basis functions. By substituting equation 2.2 into 2.1 and adding an -regularizer, we can analytically obtain the optimal solution as where is the regularization parameter, is the identity matrix, and Finally, an estimator of the log-density gradient is obtained by

It has been experimentally shown that LSLDG produces much more accurate estimates of log-density gradients than the KDE-based gradient estimator and that the clustering method based on LSLDG performs well (Sasaki et al., 2014).

## 3 Regularized Multitask Learning

In this section, we review a multitask learning framework, regularized multitask learning (Evgeniou & Pontil, 2004; Evgeniou et al., 2005), which is powerful and widely applicable to many machine learning methods.

*T*tasks of supervised learning as follows. The task

*t*is to learn an unknown function from samples of input-output pairs , where is the output with noise at the input . When is modeled by a parameterized function , learning is performed by finding the parameter , which minimizes the empirical risk associated with some loss function : where .

## 4 Proposed Method

In this section, we present our proposed method and algorithms.

### 4.1 Basic Idea

Our goal in this letter is to improve the performance of LSLDG in multidimensional cases. For multidimensional input , the log-density gradient has multiple output dimensions, meaning that its estimation consists of multiple learning tasks. Our basic idea is to apply regularized multitask learning to solve these tasks simultaneously instead of learning them independently.

This idea is supported by the fact that the target functions of these tasks, , are all derived from the same log-density , and thus they must be strongly related to each other. Under such strong relatedness, jointly learning them with sharing information with each other would improve estimation accuracy, as has been observed in other multitask learning work.

### 4.2 Regularized Multitask Learning for Least-Squares Log-density Gradients

### 4.3 Design of the Basis Functions

### 4.4 Hyperparameter Tuning

*K*-fold cross-validation is as follows. First, we randomly partition the set of training samples into

*K*folds . Next, for each , we estimate the log-density gradient using the samples in , which is denoted by , and then calculate the LSLDG scores for the samples in

*F*as : We average these LSLDG scores to obtain the

_{k}*K*-fold cross-validated LSLDG score: Finally, we choose the hyperparameters that minimize . Throughout this letter, we set

*K*= 5.

When the number of similarity parameters is large (i.e., the data dimensionality is high), cross-validation may be computationally inefficient. In this case, we can use the heuristic procedure that alternatingly updates the estimates and the similarity parameters as described in algorithm 1, where is a hyperparameter to be selected by cross-validation. In the update formula, equation 4.3, the procedure determines the similarity parameter to be used in the next iteration depending on how close the estimated parameters and are.

### 4.5 Optimization Algorithms in MT-LSLDG

#### 4.5.1 Analytic Solution

#### 4.5.2 Block Coordinate Descent Method

Direct computation of the analytic solution, equation 4.4, involves inversion of a matrix. This may be not only expensive in terms of computation time but also infeasible in terms of memory space when the dimensionality *d* is very large.

Alternatively, we propose an algorithm based on block coordinate descent (BCD) (Warga, 1963). It is an iterative algorithm that needs only manipulation of a relatively small matrix at each iteration. This alleviates the memory size requirement and, it is hoped, reduces computation time if the number of iterations is not large.

A pseudocode of the algorithm is shown in algorithm 2. At each update, equation 4.5 in the algorithm, only one vector is optimized in a closed form while fixing the other parameters (). This update requires only computing the inverse of a matrix, which seems to be computationally advantageous over evaluating the analytic solution in terms of the computation cost and memory size requirement.

Another important technique to reduce the overall computation time is to use warm start initialization: when the optimal value of is searched for by cross-validation, we may use the solutions obtained with as initial values for another .

## 5 Experiments on Log-Density Gradient Estimation

In this section, we illustrate the behavior of the proposed method and experimentally investigate its performance.

### 5.1 Experimental Setting

*C*(see section 2.2), and a smaller score means a better estimate.

_{j}We compare the following three methods:

In all the methods, we set the number of basis functions as and randomly choose the kernel centers uniformly from training samples without replacement. For hyperparameters, we use the common -regularization parameter and bandwidth parameter among all the dimensions: and . We also set all the similarity parameters as , which assumes that all dimensions are equally related to each other. In order to examine whether this assumption is reasonable, we experimentally compare MT-LSLDG with this assumption and that with the similarity parameter tuning (see algorithhm 1) in section 5.2.

### 5.2 Artificial Data

We conduct numerical experiments on artificial data to investigate the basic behavior of MT-LSLDG. As data density , we consider the following two cases:

- •
Single gaussian: The

*d*-dimensional gaussian density whose mean is 0 and whose covariance matrix is the diagonal matrix in which the first half of the diagonal elements are 1 and the others are 5. - •
Double gaussian: A mixture of two

*d*-dimensional gaussian densities with mean zero and and identity covariance matrix. The mixing coefficients are .

The dimensionality *d* and sample size *n* are specified later. First, we investigate whether MT-LSLDG improves the estimation accuracy of LSLDG at appropriate . We prepare data sets with different dimensionalities and sample sizes . MT-LSLDG is applied to the data sets at each . The gaussian bandwidth and the -regularization parameter are chosen by five-fold cross-validation as described in section 4 from the candidate lists and , respectively. The solution of MT-LSLDG is computed analytically as in equation 4.4.

The results are plotted in Figure 2. In the figure, the relative test score is defined as the test score from which the test score of S-LSLDG is subtracted, and thus negative relative scores indicate that MT-LSLDG improved the performance of S-LSLDG. When , MT-LSLDG does not improve the performance for any values (see Figures 2a and 2b). However, for higher-dimensional data, performance is improved at appropriate values (e.g., for in Figure 2a and for in Figure 2b). Similar improvement is observed also for the smaller sample sizes (e.g., and *n* = 30) in Figures 2c and Figure 2d.

These results confirm that MT-LSLDG improves the performance of S-LSLDG at an appropriate value when data are relatively high-dimensional and the sample size is small. Since such is usually unknown in advance, we need to find a reasonable value in practice.

Next, we investigate whether an appropriate value can be chosen by cross-validation. In this experiment, the cross-validation method in section 4.3 is performed to choose as well. The candidates of are . The other experimental settings such as data generation and all the LSLDGs are the same as in the previous experiment except that we also run MT-LSLDG with similarity parameter tuning by algorithm 1 with cross-validated . The candidate list for is .

Table 1 shows that MT-LSLDG improves performance especially when the dimensionality of data is relatively high and the sample size is small. These results indicate that the proposed cross-validation method allows us to choose a reasonable value. In most cases of the experiments, MT-LSLDG with gives estimation accuracy comparable to that with the similar parameter tuning procedure. In the remaining experiments, we use cross-validated and the fixed .

Density . | n
. | MT-LSLDLG . | MT-LSLDG-T . | S-LSLDG . | C-LSLDG . |
---|---|---|---|---|---|

Single gaussian | 10 | −2.87 (0.22) | −2.33 (0.37) | 0.37 (0.31) | −2.58 (0.07) |

30 | −5.34 (0.04) | −5.38 (0.02) | −4.97 (0.08) | −3.29 (0.03) | |

50 | −5.63 (0.02) | −5.64 (0.01) | −5.55 (0.02) | −4.13 (0.04) | |

Double gaussian | 10 | −6.83 (0.14) | −6.80 (0.17) | 1.01 (0.54) | −5.02 (0.12) |

30 | −8.45 (0.03) | −8.45 (0.07) | −7.63 (0.10) | −7.84 (0.04) | |

50 | −8.67 (0.02) | −8.71 (0.03) | −8.29 (0.10) | −8.48 (0.02) | |

Density | d | MT-LSLDG | MT-LSLDG-T | S-LSLDG | C-LSLDG |

Single gaussian | 2 | (0.19) | 0.21 (0.15) | −0.11 (0.15) | (0.15) |

10 | −5.34 (0.04) | −5.38 (0.02) | −4.97 (0.08) | −3.29 (0.03) | |

20 | −10.77 (0.03) | −10.76 (0.04) | −9.98 (0.13) | −6.39 (0.01) | |

Double gaussian | 2 | (0.22) | (0.25) | (0.27) | (0.22) |

10 | −8.45 (0.03) | −8.45 (0.07) | −7.63 (0.10) | −7.84 (0.04) | |

20 | −16.88 (0.14) | −17.06 (0.125) | −14.90 (0.10) | −15.26 (0.06) |

Density . | n
. | MT-LSLDLG . | MT-LSLDG-T . | S-LSLDG . | C-LSLDG . |
---|---|---|---|---|---|

Single gaussian | 10 | −2.87 (0.22) | −2.33 (0.37) | 0.37 (0.31) | −2.58 (0.07) |

30 | −5.34 (0.04) | −5.38 (0.02) | −4.97 (0.08) | −3.29 (0.03) | |

50 | −5.63 (0.02) | −5.64 (0.01) | −5.55 (0.02) | −4.13 (0.04) | |

Double gaussian | 10 | −6.83 (0.14) | −6.80 (0.17) | 1.01 (0.54) | −5.02 (0.12) |

30 | −8.45 (0.03) | −8.45 (0.07) | −7.63 (0.10) | −7.84 (0.04) | |

50 | −8.67 (0.02) | −8.71 (0.03) | −8.29 (0.10) | −8.48 (0.02) | |

Density | d | MT-LSLDG | MT-LSLDG-T | S-LSLDG | C-LSLDG |

Single gaussian | 2 | (0.19) | 0.21 (0.15) | −0.11 (0.15) | (0.15) |

10 | −5.34 (0.04) | −5.38 (0.02) | −4.97 (0.08) | −3.29 (0.03) | |

20 | −10.77 (0.03) | −10.76 (0.04) | −9.98 (0.13) | −6.39 (0.01) | |

Double gaussian | 2 | (0.22) | (0.25) | (0.27) | (0.22) |

10 | −8.45 (0.03) | −8.45 (0.07) | −7.63 (0.10) | −7.84 (0.04) | |

20 | −16.88 (0.14) | −17.06 (0.125) | −14.90 (0.10) | −15.26 (0.06) |

Notes: MT-LSLDG-T in the table refers to MT-LSLDG with similarity parameter tuning by algorithm 1. In each row, the best and comparable-to-the-best scores in terms of paired *t*-test with significance level 5% are in bold.

### 5.3 Benchmark Data

In this section, we demonstrate the usefulness of MT-LSLDG in gradient estimation on various benchmark data sets. This experiment uses some IDA benchmark data sets (Rätsch, Onoda, & Müller, 2001) and UCI benchmark data sets (Catlett, 1991; Kaya, Tüfekci, & Gürgen, 2012; Lucas et al., 2013; Tüfekci, 2014; Lichman, 2013). All the data sets are standardized in advance.

For MT-LSLDG, the hyperparameters , and are chosen by cross-validation. The candidate lists are , , and , respectively. For S-LSLDG and C-LSLDG, the candidate lists of and are the same as MT-LSLDG. The solution of MT-LSLDG is computed by the BCD algorithm described in section 4.5.2.

The results are presented in Table 2. MT-LSLDG significantly improves the performance of either S-LSLDG or C-LSLDG on most of the data sets.

Data Set (d, n)
. | MT-LSLDG . | S-LSLDG . | C-LSLDG . |
---|---|---|---|

Thyroid (5, 140) | |||

() | () | () | |

CCPP (5, 200) | |||

() | () | () | |

Diabetes (8, 468) | |||

() | () | () | |

Flare-solar (9, 666) | |||

() | () | () | |

Breast-cancer (9, 200) | |||

() | () | () | |

Shuttle (9, 1000) | |||

() | () | () | |

Image (18, 1300) | |||

() | () | () | |

Popfailures (18, 50) | |||

() | () | () | |

German | |||

() | () | () | |

Twonorm (20, 400) | |||

() | () | () | |

Waveform (21, 400) | |||

() | () | () | |

Splice (60, 1000) | |||

() | () | () |

Data Set (d, n)
. | MT-LSLDG . | S-LSLDG . | C-LSLDG . |
---|---|---|---|

Thyroid (5, 140) | |||

() | () | () | |

CCPP (5, 200) | |||

() | () | () | |

Diabetes (8, 468) | |||

() | () | () | |

Flare-solar (9, 666) | |||

() | () | () | |

Breast-cancer (9, 200) | |||

() | () | () | |

Shuttle (9, 1000) | |||

() | () | () | |

Image (18, 1300) | |||

() | () | () | |

Popfailures (18, 50) | |||

() | () | () | |

German | |||

() | () | () | |

Twonorm (20, 400) | |||

() | () | () | |

Waveform (21, 400) | |||

() | () | () | |

Splice (60, 1000) | |||

() | () | () |

Notes: In each data set, the best and comparable-to-the-best-scores in terms of paired *t*-test with significance level 5% are in bold. The number of trials is 20 for the image and splice data set and 100 for the other data sets.

We also run MT-LSLDG for three different sample sizes *n* for each of the data sets with changing from 0 to . Figure 3 summarizes the results of the experiments. The plots show that performance depends highly on and that the best differs from one data set to another, which means that it is very important to select a good by cross-validation. Also, we can see that the blue plot, for the smallest *n*, shows the largest improvement over S-LSLDG in most of the data sets. This implies that MT-LSLDG is especially advantageous when the sample size is small.

## 6 Application to Mode-Seeking Clustering

In this section, we apply MT-LSLDG to mode-seeking clustering and experimentally demonstrate its usefulness.

### 6.1 Mode-Seeking Clustering

A practical application of log-density gradient estimation is mode-seeking clustering (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002; Sasaki et al., 2014). Mode-seeking clustering methods update each data point toward a nearby mode by gradient ascent and assign the same clustering label to the data points that converged to the same mode (see Figure 4). Their notable advantage is that we need not specify the number of clusters in advance. Mode-seeking clustering has been successfully applied to a variety of real-world problems such as object tracking (Comaniciu, Ramesh, & Meer, 2000), image segmentation (Comaniciu & Meer, 2002; Sasaki et al., 2014), and line edge detection in images (Bandera, Pérez-Lorenzo, Bandera, & Sandoval, 2006).

In mode seeking, the essential ingredient is the gradient of the data density. To estimate the gradients, mean shift clustering (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002), one of the most popular mode-seeking clustering methods, employs the two-step method of first estimating the data density by kernel density estimation and then taking its gradient. However, as we mentioned earlier, this two-step method does not work well since accurately estimating the density does not necessarily lead to an accurate estimate of the gradient.

*LSLDG clustering*(Sasaki et al., 2014) adopted LSLDG instead of the two-step method. Sasaki et al. (2014) also provided a practically useful fixed-point algorithm for mode seeking as in mean shift clustering (Cheng, 1995): when the partial derivative of a vector of gaussian kernels is used as the vector of basis functions, the model can be transformed as where we assume that is nonzero. Setting to zero yields a fixed-point update formula as It has been experimentally shown that LSLDG clustering performs significantly better than mean-shift clustering (Sasaki et al., 2014).

Here, we apply MT-LSLDG to LSLDG clustering and investigate if the performance is improved in mode-seeking clustering, as well for relatively high-dimensional data.

### 6.2 Experiments

Next, we conduct numerical experiments for mode-seeking clustering.

#### 6.2.1 Experimental Setting

We apply the following four clustering methods to various data sets:

For MTL-, S-, and C-LSLDG, all the hyperparameters are cross-validated as described in section 4.3, and for mean-shift, log-likelihood cross-validation is used.

We evaluate the clustering performance by the adjusted Rand index (ARI) (Hubert & Arabie, 1985). ARI gives one to the perfect clustering assignment and zero on average to a random clustering assignment. A larger ARI value means a better clustering result.

#### 6.2.2 Artificial Data

First, we conduct experiments on artificial data. The density of the artificial data is a mixture of three *d*-dimensional gaussian densities with means , , and , covariance matrices , and mixing coefficients 0.4, 0.3, 0.3. The candidate lists of the hyperparameters are , , and .

The results are shown in Table 3. We can see that MT-LSLDGC performs well, especially for the largest dimensionality .

d
. | MT-LSLDGC . | S-LSLDGC . | C-LSLDGC . | Mean-Shift . |
---|---|---|---|---|

2 | (0.035) | (0.125) | (0.036) | (0.044) |

10 | (0.004) | (0.003) | (0.004) | 0.042 (0.022) |

15 | (0.023) | (0.054) | 0.877 (0.217) | 0.000 (0.000) |

20 | (0.190) | 0.586 (0.208) | 0.716 (0.352) | 0.036 (0.037) |

d
. | MT-LSLDGC . | S-LSLDGC . | C-LSLDGC . | Mean-Shift . |
---|---|---|---|---|

2 | (0.035) | (0.125) | (0.036) | (0.044) |

10 | (0.004) | (0.003) | (0.004) | 0.042 (0.022) |

15 | (0.023) | (0.054) | 0.877 (0.217) | 0.000 (0.000) |

20 | (0.190) | 0.586 (0.208) | 0.716 (0.352) | 0.036 (0.037) |

Notes: In each row, the best and comparable to the best ARI in terms of unpaired *t*-test with significance level 5% is emphasized in bold. The number of trials is 100.

#### 6.2.3 Real Data

Next, we perform clustering on real data. The following four data sets are used:

- •
Accelerometry data: Five-dimensional data used in Hachiya, Sugiyama, and Ueda (2012) for human activity recognition extracted from mobile sensing data available from http://alkan.mns.kyutech.ac.jp/web/data. The number of classes is 3. In each run of experiment, we use 100 randomly chosen from each class. The total number of samples is 300.

- •
Vowel data: Ten-dimensional data of recorded British English vo-wel sounds available from https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Vowel+Recognition+-+Deterding+Data). The number of classes is 11. In each run of experiment, we use 500 randomly chosen samples.

- •
Sat-image data: Thirty-six-dimensional multispectral satellite image available from https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite). The number of classes is 6. In each run of experiment, we use 2000 randomly chosen samples.

- •
Speech data: Fifty-dimensional voice data by two French speakers (Sugiyama, Niu, Yamada, Kimura, & Hachiya, 2014). The number of classes is 2. In each run of experiment, we use 200 randomly chosen samples from each class. The total number of samples is 400.

For MT-LSLDG, the hyperparameters are cross-validated using the candidates , , and except that we use relatively small candidate lists , , and for the speech data since they have large dimensionality and optimization is computationally expensive. For S- and C-LSLDG, we used the same candidates of MT-LSLDG for and . For mean shift clustering, the gaussian kernel is employed in KDE, and the bandwidth parameter in the kernel is selected by five-fold cross-validation with respect to the log likelihood of the density estimate from the same candidates of MT-LSLDG for .

The results are shown in Table 4. For the accelerometry data whose dimensionality is only five, S-LSLDGC gives the best performance. MT-LSLDGC does not improve the performance, although it performs better than C-LSLDGC.

Data Set () . | MT-LSLDGC . | S-LSLDGC . | C-LSLDGC . | Mean-Shift . |
---|---|---|---|---|

Accelerometry (5, 300) | 0.40 (0.01) | (0.02) | 0.24 (0.01) | 0.26 (0.04) |

Vowel (10, 500) | (0.00) | (0.00) | (0.00) | 0.04 (0.00) |

Sat-image (36, 2000) | (0.00) | 0.43 (0.01) | 0.35 (0.00) | 0.00 (0.00) |

Speech (50, 400) | (0.02) | 0.00 (0.00) | 0.15 (0.01) | 0.00 (0.00) |

Data Set () . | MT-LSLDGC . | S-LSLDGC . | C-LSLDGC . | Mean-Shift . |
---|---|---|---|---|

Accelerometry (5, 300) | 0.40 (0.01) | (0.02) | 0.24 (0.01) | 0.26 (0.04) |

Vowel (10, 500) | (0.00) | (0.00) | (0.00) | 0.04 (0.00) |

Sat-image (36, 2000) | (0.00) | 0.43 (0.01) | 0.35 (0.00) | 0.00 (0.00) |

Speech (50, 400) | (0.02) | 0.00 (0.00) | 0.15 (0.01) | 0.00 (0.00) |

Notes: In each row, the best and comparable-to-the-best ARI in terms of paired *t*-test with significance level 5% is in bold. The number of trials is 100 for the accelerometry data and the sat-image data and 20 for the speech data.

For the higher-dimensional data set, the vowel data, the sat-image data, and the speech data, the performance of MT-LSLDGC is the best or comparable to the best. These results indicate that MT-LSLDG is a promising method in mode-seeking clustering, especially when the dimensionality of data is relatively large.

## 7 Conclusion

We proposed a multitask log-density gradient estimator in order to improve the existing estimator in higher-dimensional cases. Our fundamental idea is to exploit the relatedness inherent in the partial derivatives through regularized multitask learning. As a result, we experimentally confirmed that our method significantly improves the accuracy of log-density gradient estimation. Finally, we demonstrated the usefulness of the proposed log-density gradient estimator in mode-seeking clustering.

Although fixing the similarity parameters to be 1 worked reasonably well in our experiments, carefully tuning them may improve the estimation accuracy. A good practice may be to use the heuristic procedure given in algorithm 1, whose properties have yet to be analyzed. Another way is to use Bayesian optimization techniques such as the gaussian process approaches studied in Bergstra, Bardenet, Bengio, and Kégl (2011) and Snoek, Larochelle, and Adams (2012), which have been experimentally shown to be reasonably fast even in large-scale hyperparameter tuning tasks.

As log-density gradient is a vector-valued function learning problem, one may consider applying kernel-based methods for such problems (Micchelli & Pontil, 2005; Caponnetto, Micchelli, Pontil, & Ying, 2008). Micchelli and Pontil (2005) showed a representer theorem for a wide class of optimizations in a reproducing kernel Hilbert space of vector-valued functions whose objective functional depends only on outputs of the function to be optimized. However, it may not be possible to directly employ those methods in the LSLDG framework since the LSLDG objective functional also depends on the gradients besides outputs of the function. It is an important open question whether LSLDG also admits a similar representer theorem.

## Acknowledgments

We thank the anonymous reviewers for their valuable comments. I.Y. was supported by KAKENHI 26280054, H.S. was supported by KAKENHI 15H06103, and M.S. was supported by KAKENHI 25700022.

## References

*n*-th thing any easier than learning the first?