## Abstract

Estimating the derivatives of probability density functions is an essential step in statistical data analysis. A naive approach to estimate the derivatives is to first perform density estimation and then compute its derivatives. However, this approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator. To cope with this problem, in this letter, we propose a novel method that directly estimates density derivatives without going through density estimation. The proposed method provides computationally efficient estimation for the derivatives of any order on multidimensional data with a hyperparameter tuning method and achieves the optimal parametric convergence rate. We further discuss an extension of the proposed method by applying regularized multitask learning and a general framework for density derivative estimation based on Bregman divergences. Applications of the proposed method to nonparametric Kullback-Leibler divergence approximation and bandwidth matrix selection in kernel density estimation are also explored.

## 1 Introduction

The derivatives of probability density functions are important quantities in various problems of statistical data analysis—for instance:

*Mean shift clustering*updates data points toward the modes of an estimated data density by gradient ascent (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002; Sasaki, Hyvärinen, & Sugiyama, 2014). The first-order density-derivative is the essential tool.The second-order density-derivative is the key ingredient of the optimal bandwidth parameter (Silverman, 1986) and the optimal bandwidth matrix (Wand & Jones, 1994b) in kernel density estimation (KDE) for uni- and multivariate data, respectively.

The bias of nearest-neighbor Kullback-Leibler (KL) divergence approximation depends on the second-order density derivatives (Noh et al., 2014).

An algorithm for independent component analysis was proposed based on the outer products of density gradients (Samarov & Tsybakov, 2004).

More statistical problems relying on density derivatives are discussed in Singh (1977a).

Therefore, accurate estimation of density derivatives is a fundamental research topic in statistical data analysis.

*K*and the bandwidth parameter

*h*, the

*k*th order density derivative is computed by However, this approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator. This problem seems to be more critical if higher-order derivatives are computed from lower-order derivative estimates.

*h*in because the optimal value requires estimating a higher-order derivative than the target derivative (Singh, 1981, 1987).

In this letter, we propose a novel density derivative estimator with the following utilities:

Multidimensional density derivatives of any order can be directly estimated without density estimation.

The solution of the estimator can be efficiently computed in a closed form.

A cross-validation method is available for tuning all hyperparameters in the estimator.

The estimator is proposed as a minimizer of the integrated squared error (ISE) to the true density derivative and is called ISE for density derivatives (ISED). We theoretically prove that ISED achieves the parametric optimal convergence rate and experimentally demonstrate that ISED significantly outperforms a KDE-based density derivative estimator, especially for higher-dimensional data. Furthermore, as future directions of this work, we discuss an extension of ISED by applying the regularized multitask learning (Caruana, 1997; Evgeniou & Pontil, 2004; Argyriou, Evgeniou, & Pontil, 2007), and a rather general framework of density derivative estimation based on Bregman divergences (Bregman, 1967) where another estimator for density derivatives is derived as a generalization of ISED.

We further apply ISED to KL-divergence approximation and the optimal bandwidth matrix selection for KDE. Unlike previous work (Noh et al., 2014; Duong & Hazelton, 2005; Chacón & Duong, 2010), in both applications, we estimate the second-order derivatives directly by ISED without making use of any parametric models or any assumptions on the data density. The experimental results confirm the usefulness of ISED on various tasks, such as change detection, feature selection, and density estimation.

This letter is organized as follows. Section 2 formulates the problem of density derivative estimation and derives a novel estimator. A theoretical analysis of the proposed estimator is also provided. The performance of the proposed estimator is numerically illustrated in section 3. Applications to KL-divergence approximation and bandwidth matrix selection for KDE are explored in sections 4 and 5, respectively. Future directions of this work are discussed in section 6. Section 7 concludes this letter.

## 2 Direct Density Derivative Estimation

In this section, we propose a novel direct estimator for density derivatives.

### 2.1 Problem Formulation

*k*th order partial derivative of , where , denotes the transpose, for , and . When (or ), corresponds to a single element in the gradient vector (or the Hessian matrix) of .

### 2.2 Integrated Squared Error for Density Derivatives

*x*

_{1}and the last equality holds under a mild assumption that the tails of approach zero. The right-hand side on the last equality shows that the integration by parts decreases the order of the true density derivative to by differentiating the model. Thus, repeatedly applying integration by parts

*k*times gives the following equation as By approximating the second term by the sample average, we reach the empirical version of ISED as

*b*denotes the number of basis functions and equals in the experiments of this letter, and center points are fixed to be

*b*data samples randomly chosen from . Then the

*k*th order derivative of the model is given by Substituting these formulas into equation 2.3 and adding the -regularizer yield a quadratic objective function, where The minimizer of equation 2.4 can be obtained analytically as where

*I*denotes the

_{b}*b*by

*b*identity matrix. Finally, a density derivative estimator is obtained as

We call this method the integrated squared error for density derivatives (ISED), which is a generalization of least-squares density-difference for density-difference estimation (Kim & Scott, 2010; Sugiyama et al., 2013), and is an extension of the direct estimator for log-density gradients (Cox, 1985; Sasaki et al., 2014) to higher-order derivatives.

### 2.3 Model Selection by Cross-Validation

The performance of the proposed method depends on the choice of model parameters such as the gaussian width and regularization parameter . Below, we describe a cross-validation method to select the optimal model:

Step 1. Divide the samples into

*T*disjoint subsets .Step 3. Choose the model that minimizes .

This CV method is essentially same as the one proposed by Hardle, Marron, and Wand (1990) for density derivative estimation based on KDE.

### 2.4 Theoretical Analysis

Our result of theoretical analysis is given in the following theorem:

The proof is found in the appendix.

^{1}asserts that ISED is a consistent estimator and achieves the optimal parametric convergence rate under some assumptions. However, in practice, the positive definite assumption on might not always be satisfied. This problem can be avoided by adding the regularizer with an appropriate regularization parameter . Under this setting, we define the optimal solution and estimator as Then, we obtain a similar statement as a corollary of theorem

^{1}:

## 3 Illustration of Density Derivative Estimation

In this section, we numerically illustrate the behavior of ISED on artificial data and compare it with a KDE-based density derivative estimator.

In this illustration, data samples were drawn from the standard normal density. We performed model selection based on cross-validation as described in section 2.3: the gaussian bandwidth parameter was chosen from the nine candidates from to at the regular interval in logarithmic scale, and the regularization parameter was chosen from the nine candidates from to at the regular interval in logarithmic scale. For KDE, the gaussian kernel was employed, and the bandwidth parameter was chosen as follows:

**KDE**. The bandwidth parameter was cross-validated with respect to the holdout least-squared error of the estimated density to the true density.**KDE**. The bandwidth parameter was cross-validated with respect to the holdout ISED criterion, equation 2.5.

For KDE, we chose the bandwidth parameter from the nine candidates from to at the regular interval in logarithmic scale. The cross-validation method of KDE is the same as the one that Hardle et al. (1990) proposed, and in the cross-validation, we used the same candidates as ISED.

Figures 1a and 1b show that ISED produces more accurate estimates than both KDE and KDE in density derivative estimation. Figure 1c depicts the density estimate obtained by integrating the ISED estimate of the first-order density derivative, indicating that ISED well approximates the density function up to an unknown constant. KDE works well in density estimation but produces considerable errors in density derivative estimation, particularly for the second-order derivative. This result highlights that a good density estimator is not necessarily a good density derivative estimator. The density derivative estimates from KDE are more accurate than KDE, but they are not as accurate as ISED. Furthermore, KDE provides an oversmoothed density estimate, which is less accurate than ISED.

In addition to the standard normal density, we performed the same experiments using a mixture of two gaussians and independent log-normal density:

For the mixture of two gaussians, all elements in one mean vector were 2, while the elements in the other mean vector were . The covariance matrices in the two gaussians were the identity matrix, and the mixing coefficients were both .

For cross-validation of ISED and KDE_{I} in the log-normal density, we set nine candidates for their bandwidth parameters from to at the regular interval in logarithmic scale because the scale is different from other densities. For KDE_{S}, we set the nine candidates from to in the log-normal density. As presented in Figures 2 and 3, the results further substantiate the superior performance of ISED.

## 4 KL-Divergence Approximation

In this section, we apply ISED to KL-divergence approximation.

### 4.1 Nearest-Neighbor KL-Divergence Approximation

### 4.2 Metric Learning with Parametric Bias Estimation for NNDE-Based KL-Divergence Approximation

To compute the bias matrix , Noh et al. (2014) assumed that *p*_{1} and *p*_{2} are both nearly gaussian, and estimated densities *p*_{1} and *p*_{2} as well as their Hessian matrices and using the gaussian parametric model with maximum likelihood estimation. The accuracy of NNDE-based KL-divergence approximation was significantly improved when *p*_{1} and *p*_{2} are nearly gaussian.

### 4.3 Metric Learning with Nonparametric Bias Estimation by ISED for NNDE-Based KL-Divergence Approximation

The evident limitation of the above method is the gaussian assumption on *p*_{1} and *p*_{2} when estimating . Thus, its applicability would be rather limited. To extend the applicability, a naive approach to nonparametrically estimate is to first perform nonparametric density estimation for *p*_{1} and *p*_{2} separately, then compute their Hessian matrices and , and finally plug them into the definition of . However, such a plug-in approach can be unreliable because a good density estimator does not necessarily mean a good estimator of its Hessian matrix, as already demonstrated in section 3. In addition, division by estimated densities in could significantly magnify the estimation errors of Hessian matrices.

### 4.4 Illustration of KL-Divergence Approximation

Here, we compare the performance of the proposed KL-divergence approximator based on ISED with the following approximators:

NN: NNDE-based KL-divergence approximator without metric learning (Wang et al., 2006)

NNG: NNDE-based KL-divergence approximator with gaussian-based metric learning (Noh et al., 2014)

KDE: NNDE-based KL-divergence approximator with KDE-based metric learning

Ratio: Density-ratio-based nonparametric KL-divergence approximator (Nguyen, Wainwright, & Jordan, 2010)

fRisk: Risk-based nearest-neighbor KL-divergence approximator (Garcia-Garcia, von Luxburg, & Santos-Rodriguez, 2011)

GP: Gaussian parametric KL-divergence approximator with maximum likelihood estimation.

*p*

_{1}and

*p*

_{2}, we set where the value of is selected so that the variance is one. For the cross-validation of ISED, we used the nine candidates from to for and to for at the regular interval in logarithmic scale. For KDE, the density derivative estimation and density estimation were separately performed by KDE and KDE, respectively, and these estimates were substituted into equation 4.2. The bandwidth parameters in both KDE and KDE were cross-validated as in the previous experiment. The true KL-divergence value was approximated from samples by the sample average.

Figure 4 shows the experimental results of KL-divergence approximation for the generalized gaussian density as functions of the data dimension *d* when . In the figures, the relative KL-divergence is defined by the approximated KL-divergence values subtracted by the true KL-divergence value. For all data, ISED works well to a wide range of dimensions. NN, NNG, and GP produce good performance for supergaussian and gaussian data, but they strongly underestimate the KL-divergence for subgaussian data. The performance of KDE in all data is not good or shows highvariance, especially for high-dimensional data. fRisk is comparable to ISED for subgaussian data but overestimates the KL-divergence for other data. Ratio systematically underestimates it for all data. Figure 5 shows that as the sample size *n* increases, ISED tends to produce more accurate estimates for all data. For gaussian data, ISED is comparable even to GP and NNG, both of which assume the gaussian density. NN and Ratio underestimate for all data, while fRisk overestimates for supergaussian and gaussian data. Figure 6 further illustrates that ISED performs well with different mean parameter values

Moreover, we investigate the performance of ISED using mixtures of two gaussians and log-normal densities:

- For the log-normal densities, where denotes the log-normal density defined by equation 3.1.

For mixtures of two gaussians, the hyperparameters in ISED were cross-validated as in the last experiments, but for the log-normal densities, we used the nine candidates from to for because the scale is different from other densities. The results presented in Figure 7 indicate that the KL-divergence approximator based on ISED performs well to both densities as well.

### 4.5 Distributional Change Detection

*t*, and a collection of time window vectors is obtained by sliding the windows as Following Liu et al. (2013), we assume the existence of an underlying density function that generates

*r*retrospective vectors in . To find change points, as a change score, we use the KL-divergence between the underlying density functions of the two sets, and for every

*t*: if the KL-divergence for and is greater than a threshold, a point is determined as a change point. In this experiment, we set and .

We use the Human Activity Sensing Consortium (HASC) Challenge 2011 data collection,^{1} which provides human activity information collected by a portable three-axis accelerometer. Our task is to segment different activities such as “stay,” “walk,” “jog,” and “skip.” As in Liu et al. (2013), we took the -norm of three-dimensional accelerometer data and obtained one-dimensional data because the orientation of the accelerometer is not necessarily fixed.

Two examples of time series data and the KL divergence approximated by ISED are illustrated in Figure 8. The figure shows that the approximated KL-divergence tends to be large around the change points, and thus KL-divergence can be used as change scores. Table 1 summarizes the results over 10 runs where a point was classified as a change point if it was within a small tolerance region ( ms around an “exact” change point) and the performance was measured by the area under the ROC curve (AUC). From the table, we can confirm that ISED significantly outperforms GP and NNG and is comparable to fRisk. The best performance of fRisk could be interpreted in the following way. As reported by Noh et al. (2014), fRisk keeps the values of the estimated KL-divergence similar to different data densities. This property works as a “regularizer” to stabilize the change scores, although it is poor as a KL-divergence approximator.

GP . | NNG . | fRisk . | ISED . |
---|---|---|---|

0.747(0.050) | 0.822(0.030) | 0.858(0.022) | 0.837(0.027) |

GP . | NNG . | fRisk . | ISED . |
---|---|---|---|

0.747(0.050) | 0.822(0.030) | 0.858(0.022) | 0.837(0.027) |

Note: The best method and methods comparable to the best one in terms of the mean AUC by the unpaired *t*-test with significance level are highlighted in bold.

### 4.6 Information Theoretical Feature Selection

*t*-score, defined by where (and ) and (and ) are the mean value and standard deviation of the class (and ), respectively. However, the limitation of the

*t*-score is that it does not take into account the correlation and redundancy between features. As recent alternative criterion, the Jensen-Shannon (JS) divergence has been used (Bontempi & Meyer, 2010) as where . The JS divergence is known as the Shannon mutual information between binary class labels

*y*and features . Previously the JS divergence has been approximated parametrically (Bontempi & Meyer, 2010). In contrast to that, here, we nonparametrically approximate it by using the proposed KL-divergence approximator.

In this experiment, two gene expression data sets of breast cancer prognosis studies, SMK-CAN-187 (Freije et al., 2004) and VANTVEER (van’t Veer et al., 2002), were used. The SMK-CAN-187 data set contains 90 positive (alive) and 97 negative (dead after five years) samples with 19,993 features. We randomly selected 65 samples per class for training and used the rest for evaluating the classification performance. The VANTVEER data set contains 46 positive and 51 negative samples with 24,481 features. Thirty-five samples per class were randomly selected for training, and the rest were used for testing.

For comparison, we applied the following methods to the data sets:

ISED: A nonparametric JS divergence approximator based on the proposed KL-divergence approximator

NNG: A nonparametric JS divergence approximator based on the NNDE-based KL-divergence approximator with gaussian-based metric learning (Noh et al., 2014)

NN: A nonparametric JS divergence approximator based on the NNDE-based KL-divergence approximator (Wang et al., 2006)

mIMR: A parametric JS divergence approximator (Bontempi & Meyer, 2010)

*t*-score: A conventional feature selection criterion

We chose features based on the forward selection strategy and measured the performance by the AUC of classification. The results are shown in Table 2, confirming the reasonable performance of the proposed method in feature selection.

Data Set . | ISED . | NNG . | NN . | mIMR . | t-score
. |
---|---|---|---|---|---|

SMK-CAN-187 | 0.721(0.008) | 0.739(0.008) | 0.739(0.007) | 0.732(0.008) | 0.707(0.010) |

VANTVEER | 0.753(0.008) | 0.723(0.011) | 0.729(0.010) | 0.718(0.011) | 0.702(0.011) |

Data Set . | ISED . | NNG . | NN . | mIMR . | t-score
. |
---|---|---|---|---|---|

SMK-CAN-187 | 0.721(0.008) | 0.739(0.008) | 0.739(0.007) | 0.732(0.008) | 0.707(0.010) |

VANTVEER | 0.753(0.008) | 0.723(0.011) | 0.729(0.010) | 0.718(0.011) | 0.702(0.011) |

Note: The best method and methods comparable to the best one in terms of the mean AUC by the unpaired *t*-test with significance level are highlighted in bold.

## 5 Kernel Density Estimation with the Bandwidth Matrix

In this section, we compare the proposed method to KDE with the bandwidth matrix in terms of both density and density derivative estimation.

### 5.1 Density Estimation

Here, we first review the existing methods for bandwidth matrix selection in KDE and then propose a method for bandwidth matrix selection based on ISED.

#### 5.1.1 Existing Methods for Bandwidth Matrix Selection

*L*with a bandwidth matrix . A key challenge in this approach is to estimate . Chacón and Duong (2010) estimated as a minimizer of the following mean squared error (MSE), where is the vector including as an element and is its estimate by the KDE. To improve pilot estimates, Chacón and Duong (2010) further performed two-stage estimation for : at the first stage, is estimated by minimizing MSE for the higher-order , and the estimated , denoted by , is used to approximate at the second stage. Then, , the final estimate of , is obtained by minimizing MSE, equation 5.4. However, to obtain at the first stage, needs to be known. To cope with this problem, Chacón and Duong (2010) computed under the assumption that is gaussian, but the gaussian assumption does not always work well in general.

#### 5.1.2 Bandwidth Matrix Selection Based on ISED

As reviewed above, the existing methods for bandwidth matrix selection rely on the gaussian assumption at a certain phase. The method proposed in Horová, Koláček, and Vopatová (2013) does not employ gaussian assumptions, but its practical algorithm restricts the structure of the bandwidth matrix and reduces the bandwidth matrix selection to the single bandwidth parameter selection. Unlike these methods, we estimate the bandwidth matrix without making any strong assumptions of the data density and restricting the structure of .

We follow the approach of minimizing AMISE but remove the gaussian assumption in previous works: the Hessian matrix is estimated by ISED, and the estimated Hessian matrix is plugged into to compute AMISE, equation 5.2. Since the bandwidth matrix must be symmetric positive definite (SPD), using the MATLAB software *Manopt* (Boumal, Mishra, Absil, & Sepulchre, 2014), we perform manifold optimization to minimize AMISE, equation 5.2, with respect to on the SPD manifold.

#### 5.1.3 Numerical Illustration on Artificial and Benchmark Data

Here, we experimentally investigate the performance of the proposed bandwidth matrix selection method. For comparison, we apply the following methods to the same data sets:

ISED: A bandwidth matrix selector minimizing AMISE with the ISED plug-in estimator for

SCV: A bandwidth matrix selector based on minimizing the smoothed cross-validation criterion (Hall et al., 1992; Duong & Hazelton, 2005)

NR: A bandwidth matrix selector minimizing AMISE based on the normal reference rule (gaussian assumption) for the data density (Chacón et al., 2011)

PI: A bandwidth matrix selector minimizing AMISE with pilot estimates for (Chacón & Duong, 2010)

Except for ISED, we used the R library *ks* whose details are available in Duong (2007).

Data samples were drawn from the following six densities:

Density 1: A gaussian density with mean and covariance matrix . has a tridiagonal shape: the diagonals are 1, for all

*i*are and the other off-diagonals are 0.Density 2: A mixture of two gaussians with means and and covariance matrices .

Density 3: A mixture of two gaussians with means and and covariance matrices and . The diagonals in and are 1, the th and th elements in and are , and the others are 0.

Density 4: A mixture of two gaussians with means and and covariance matrices and . The diagonals in are 1, the th and th elements in are , and the others are 0.

Density 5: A mixture of three gaussians with means , , and and covariance matrices .

Density 6: A mixture of three gaussians with means , , and and covariance matrices , and . The diagonals in and are 1, the th and th elements are for and for , and the others are 0.

*n*training samples and test samples. Using the training samples, the bandwidth matrix was estimated. We measured the performance by the empirical ISE using the test data:

The results over runs on artificial data are summarized in Tables 3 to 8. For the gaussian density (see Table 3), SCV or NR is the best method because these methods make use of the gaussian assumption. However, the performance of ISED is not so bad. Tables 4 to 8 show that when the data density is mixtures of gaussians, ISED always outperforms the other methods. The reason would be that ISED does not have any strong assumptions on the data density, while the other methods assume that the data density is gaussian at a certain phase. Thus, the proposed method based on ISED is quite promising. Furthermore, from Table 9 for the benchmark data sets, we can confirm that the proposed method reasonably works well on benchmark data sets.

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.073(0.254) | −6.098(0.275) | −6.110(0.241) | −5.940(0.228) | |

−6.429(0.166) | −6.535(0.201) | −6.504(0.159) | −6.182(0.128) | |

−7.296(0.131) | −7.361(0.173) | −7.286(0.125) | −6.782(0.101) | |

−8.040(0.109) | −8.137(0.137) | −8.014(0.090) | −7.295(0.095) | |

−6.584(0.212) | −6.594(0.246) | −6.609(0.205) | −6.493(0.177) | |

−6.853(0.153) | −6.932(0.184) | −6.931(0.151) | −6.706(0.117) | |

−7.647(0.110) | −7.682(0.139) | −7.658(0.109) | −7.284(0.070) | |

−8.334(0.085) | −8.397(0.109) | −8.341(0.079) | −7.782(0.055) | |

−6.962(0.207) | −6.962(0.231) | −6.981(0.205) | −6.900(0.183) | |

−7.198(0.135) | −7.256(0.165) | −7.260(0.135) | −7.089(0.103) | |

−7.934(0.104) | −7.960(0.134) | −7.950(0.107) | −7.651(0.068) | |

−8.579(0.076) | −8.628(0.100) | −8.597(0.078) | −8.132(0.045) |

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.073(0.254) | −6.098(0.275) | −6.110(0.241) | −5.940(0.228) | |

−6.429(0.166) | −6.535(0.201) | −6.504(0.159) | −6.182(0.128) | |

−7.296(0.131) | −7.361(0.173) | −7.286(0.125) | −6.782(0.101) | |

−8.040(0.109) | −8.137(0.137) | −8.014(0.090) | −7.295(0.095) | |

−6.584(0.212) | −6.594(0.246) | −6.609(0.205) | −6.493(0.177) | |

−6.853(0.153) | −6.932(0.184) | −6.931(0.151) | −6.706(0.117) | |

−7.647(0.110) | −7.682(0.139) | −7.658(0.109) | −7.284(0.070) | |

−8.334(0.085) | −8.397(0.109) | −8.341(0.079) | −7.782(0.055) | |

−6.962(0.207) | −6.962(0.231) | −6.981(0.205) | −6.900(0.183) | |

−7.198(0.135) | −7.256(0.165) | −7.260(0.135) | −7.089(0.103) | |

−7.934(0.104) | −7.960(0.134) | −7.950(0.107) | −7.651(0.068) | |

−8.579(0.076) | −8.628(0.100) | −8.597(0.078) | −8.132(0.045) |

Notes: The numbers in parentheses are their standard deviations. The best method and methods judged to be comparable to the best one by the unpaired *t*-test at the significance level are in bold. *d* and *n* denote the dimensionality of data and number of samples, respectively.

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.662(0.206) | −6.383(0.170) | −6.402(0.139) | −6.509(0.183) | |

−7.441(0.131) | −7.215(0.119) | −7.274(0.106) | −7.178(0.102) | |

−8.330(0.105) | −8.171(0.096) | −8.205(0.082) | −7.913(0.075) | |

−9.313(0.123) | −9.207(0.077) | −9.183(0.061) | −8.654(0.082) | |

−7.126(0.172) | −6.900(0.184) | −6.775(0.124) | −6.992(0.161) | |

−7.817(0.111) | −7.600(0.124) | −7.597(0.096) | −7.615(0.095) | |

−8.649(0.085) | −8.459(0.087) | −8.498(0.072) | −8.338(0.059) | |

−9.590(0.070) | −9.427(0.072) | −9.451(0.060) | −9.082(0.049) | |

−7.509(0.153) | −7.322(0.172) | −7.070(0.113) | −7.378(0.143) | |

−8.153(0.108) | −7.948(0.128) | −7.863(0.091) | −7.967(0.094) | |

−8.915(0.078) | −8.723(0.085) | −8.724(0.067) | −8.647(0.057) | |

−9.804(0.058) | −9.625(0.061) | −9.652(0.048) | −9.384(0.036) |

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.662(0.206) | −6.383(0.170) | −6.402(0.139) | −6.509(0.183) | |

−7.441(0.131) | −7.215(0.119) | −7.274(0.106) | −7.178(0.102) | |

−8.330(0.105) | −8.171(0.096) | −8.205(0.082) | −7.913(0.075) | |

−9.313(0.123) | −9.207(0.077) | −9.183(0.061) | −8.654(0.082) | |

−7.126(0.172) | −6.900(0.184) | −6.775(0.124) | −6.992(0.161) | |

−7.817(0.111) | −7.600(0.124) | −7.597(0.096) | −7.615(0.095) | |

−8.649(0.085) | −8.459(0.087) | −8.498(0.072) | −8.338(0.059) | |

−9.590(0.070) | −9.427(0.072) | −9.451(0.060) | −9.082(0.049) | |

−7.509(0.153) | −7.322(0.172) | −7.070(0.113) | −7.378(0.143) | |

−8.153(0.108) | −7.948(0.128) | −7.863(0.091) | −7.967(0.094) | |

−8.915(0.078) | −8.723(0.085) | −8.724(0.067) | −8.647(0.057) | |

−9.804(0.058) | −9.625(0.061) | −9.652(0.048) | −9.384(0.036) |

Note: See the Table 3 note.

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.016(0.190) | −4.834(0.038) | −4.431(0.014) | −4.976(0.045) | |

−6.790(0.136) | −5.832(0.034) | −5.624(0.016) | −5.924(0.035) | |

−7.693(0.116) | −6.924(0.027) | −6.826(0.015) | −6.978(0.024) | |

−8.684(0.129) | −8.072(0.020) | −8.034(0.013) | −8.067(0.020) | |

−6.495(0.174) | −5.217(0.043) | −4.523(0.011) | −5.354(0.048) | |

−7.209(0.113) | −6.109(0.034) | −5.704(0.013) | −6.184(0.035) | |

−8.036(0.085) | −7.124(0.024) | −6.896(0.011) | −7.166(0.022) | |

−8.973(0.076) | −8.220(0.021) | −8.099(0.012) | −8.228(0.016) | |

−6.885(0.142) | −5.576(0.046) | −4.599(0.009) | −5.712(0.051) | |

−7.553(0.108) | −6.360(0.034) | −5.767(0.010) | −6.415(0.033) | |

−8.330(0.079) | −7.309(0.025) | −6.952(0.010) | −7.332(0.023) | |

−9.208(0.059) | −8.357(0.019) | −8.148(0.009) | −8.357(0.016) |

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.016(0.190) | −4.834(0.038) | −4.431(0.014) | −4.976(0.045) | |

−6.790(0.136) | −5.832(0.034) | −5.624(0.016) | −5.924(0.035) | |

−7.693(0.116) | −6.924(0.027) | −6.826(0.015) | −6.978(0.024) | |

−8.684(0.129) | −8.072(0.020) | −8.034(0.013) | −8.067(0.020) | |

−6.495(0.174) | −5.217(0.043) | −4.523(0.011) | −5.354(0.048) | |

−7.209(0.113) | −6.109(0.034) | −5.704(0.013) | −6.184(0.035) | |

−8.036(0.085) | −7.124(0.024) | −6.896(0.011) | −7.166(0.022) | |

−8.973(0.076) | −8.220(0.021) | −8.099(0.012) | −8.228(0.016) | |

−6.885(0.142) | −5.576(0.046) | −4.599(0.009) | −5.712(0.051) | |

−7.553(0.108) | −6.360(0.034) | −5.767(0.010) | −6.415(0.033) | |

−8.330(0.079) | −7.309(0.025) | −6.952(0.010) | −7.332(0.023) | |

−9.208(0.059) | −8.357(0.019) | −8.148(0.009) | −8.357(0.016) |

Note: See the Table 3 note.

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.177(0.193) | −5.233(0.058) | −4.837(0.023) | −5.378(0.065) | |

−6.982(0.138) | −6.199(0.048) | −5.999(0.025) | −6.301(0.047) | |

−7.906(0.116) | −7.260(0.037) | −7.174(0.023) | −7.313(0.031) | |

−8.918(0.102) | −8.382(0.027) | −8.357(0.020) | −8.351(0.029) | |

−6.641(0.170) | −5.622(0.062) | −4.961(0.018) | −5.746(0.069) | |

−7.375(0.113) | −6.502(0.047) | −6.110(0.020) | −6.580(0.045) | |

−8.227(0.082) | −7.488(0.032) | −7.272(0.017) | −7.535(0.028) | |

−9.185(0.074) | −8.559(0.029) | −8.448(0.017) | −8.556(0.019) | |

−7.039(0.143) | −5.960(0.058) | −5.059(0.014) | −6.071(0.064) | |

−7.716(0.103) | −6.757(0.044) | −6.193(0.016) | −6.810(0.039) | |

−8.501(0.078) | −7.689(0.033) | −7.347(0.015) | −7.716(0.029) | |

−9.401(0.060) | −8.710(0.025) | −8.515(0.014) | −8.708(0.020) |

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.177(0.193) | −5.233(0.058) | −4.837(0.023) | −5.378(0.065) | |

−6.982(0.138) | −6.199(0.048) | −5.999(0.025) | −6.301(0.047) | |

−7.906(0.116) | −7.260(0.037) | −7.174(0.023) | −7.313(0.031) | |

−8.918(0.102) | −8.382(0.027) | −8.357(0.020) | −8.351(0.029) | |

−6.641(0.170) | −5.622(0.062) | −4.961(0.018) | −5.746(0.069) | |

−7.375(0.113) | −6.502(0.047) | −6.110(0.020) | −6.580(0.045) | |

−8.227(0.082) | −7.488(0.032) | −7.272(0.017) | −7.535(0.028) | |

−9.185(0.074) | −8.559(0.029) | −8.448(0.017) | −8.556(0.019) | |

−7.039(0.143) | −5.960(0.058) | −5.059(0.014) | −6.071(0.064) | |

−7.716(0.103) | −6.757(0.044) | −6.193(0.016) | −6.810(0.039) | |

−8.501(0.078) | −7.689(0.033) | −7.347(0.015) | −7.716(0.029) | |

−9.401(0.060) | −8.710(0.025) | −8.515(0.014) | −8.708(0.020) |

Note: See the Table 3 note.

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.855(0.153) | −6.467(0.131) | −6.008(0.059) | −6.712(0.146) | |

−7.668(0.115) | −7.263(0.097) | −7.060(0.052) | −7.455(0.101) | |

−8.598(0.091) | −8.206(0.067) | −8.146(0.047) | −8.313(0.068) | |

−9.602(0.141) | −9.246(0.049) | −9.257(0.040) | −9.227(0.051) | |

−7.314(0.134) | −7.013(0.136) | −6.278(0.056) | −7.181(0.135) | |

−8.046(0.110) | −7.697(0.101) | −7.277(0.049) | −7.850(0.100) | |

−8.908(0.083) | −8.540(0.075) | −8.329(0.045) | −8.650(0.070) | |

−9.859(0.078) | −9.502(0.055) | −9.418(0.037) | −9.529(0.042) | |

−7.678(0.149) | −7.442(0.162) | −6.501(0.059) | −7.559(0.159) | |

−8.355(0.105) | −8.059(0.113) | −7.463(0.048) | −8.165(0.109) | |

−9.157(0.076) | −8.825(0.070) | −8.480(0.037) | −8.911(0.065) | |

−10.072(0.061) | −9.727(0.055) | −9.548(0.034) | −9.758(0.042) |

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.855(0.153) | −6.467(0.131) | −6.008(0.059) | −6.712(0.146) | |

−7.668(0.115) | −7.263(0.097) | −7.060(0.052) | −7.455(0.101) | |

−8.598(0.091) | −8.206(0.067) | −8.146(0.047) | −8.313(0.068) | |

−9.602(0.141) | −9.246(0.049) | −9.257(0.040) | −9.227(0.051) | |

−7.314(0.134) | −7.013(0.136) | −6.278(0.056) | −7.181(0.135) | |

−8.046(0.110) | −7.697(0.101) | −7.277(0.049) | −7.850(0.100) | |

−8.908(0.083) | −8.540(0.075) | −8.329(0.045) | −8.650(0.070) | |

−9.859(0.078) | −9.502(0.055) | −9.418(0.037) | −9.529(0.042) | |

−7.678(0.149) | −7.442(0.162) | −6.501(0.059) | −7.559(0.159) | |

−8.355(0.105) | −8.059(0.113) | −7.463(0.048) | −8.165(0.109) | |

−9.157(0.076) | −8.825(0.070) | −8.480(0.037) | −8.911(0.065) | |

−10.072(0.061) | −9.727(0.055) | −9.548(0.034) | −9.758(0.042) |

Note: See the Table 3 note.

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.276(0.128) | −5.706(0.077) | −5.350(0.036) | −5.959(0.091) | |

−7.150(0.109) | −6.644(0.061) | −6.483(0.033) | −6.836(0.065) | |

−8.117(0.118) | −7.684(0.043) | −7.633(0.031) | −7.805(0.048) | |

−9.164(0.142) | −8.794(0.034) | −8.797(0.028) | −8.818(0.037) | |

−6.699(0.114) | −6.138(0.081) | −5.524(0.033) | −6.362(0.090) | |

−7.515(0.095) | −6.972(0.059) | −6.625(0.028) | −7.149(0.063) | |

−8.426(0.069) | −7.932(0.045) | −7.755(0.027) | −8.061(0.044) | |

−9.406(0.065) | −8.980(0.035) | −8.907(0.024) | −9.044(0.029) | |

−7.066(0.109) | −6.510(0.083) | −5.668(0.029) | −6.715(0.093) | |

−7.808(0.082) | −7.267(0.061) | −6.747(0.025) | −7.418(0.064) | |

−8.659(0.062) | −8.153(0.043) | −7.855(0.022) | −8.268(0.042) | |

−9.605(0.051) | −9.149(0.033) | −8.994(0.021) | −9.219(0.028) |

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

−6.276(0.128) | −5.706(0.077) | −5.350(0.036) | −5.959(0.091) | |

−7.150(0.109) | −6.644(0.061) | −6.483(0.033) | −6.836(0.065) | |

−8.117(0.118) | −7.684(0.043) | −7.633(0.031) | −7.805(0.048) | |

−9.164(0.142) | −8.794(0.034) | −8.797(0.028) | −8.818(0.037) | |

−6.699(0.114) | −6.138(0.081) | −5.524(0.033) | −6.362(0.090) | |

−7.515(0.095) | −6.972(0.059) | −6.625(0.028) | −7.149(0.063) | |

−8.426(0.069) | −7.932(0.045) | −7.755(0.027) | −8.061(0.044) | |

−9.406(0.065) | −8.980(0.035) | −8.907(0.024) | −9.044(0.029) | |

−7.066(0.109) | −6.510(0.083) | −5.668(0.029) | −6.715(0.093) | |

−7.808(0.082) | −7.267(0.061) | −6.747(0.025) | −7.418(0.064) | |

−8.659(0.062) | −8.153(0.043) | −7.855(0.022) | −8.268(0.042) | |

−9.605(0.051) | −9.149(0.033) | −8.994(0.021) | −9.219(0.028) |

Note: See the Table 3 note.

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

Housing | −0.049(0.096) | −0.001(0.000) | −0.001(0.000) | −0.012(0.002) |

Concrete | −0.138(0.016) | −0.012(0.000) | −0.007(0.000) | −0.091(0.004) |

Yacht | −2.346(0.084) | −0.046(0.003) | −0.028(0.002) | −0.277(0.023) |

Auto MPG | −0.189(0.046) | −0.029(0.002) | −0.031(0.002) | −0.096(0.008) |

Servo | −0.008(0.001) | −0.009(0.001) | −0.008(0.001) | −0.010(0.002) |

Air Foil | −0.770(0.043) | −0.036(0.001) | −0.020(0.001) | −0.087(0.002) |

Power Plant | −0.036(0.002) | −0.028(0.000) | −0.026(0.000) | −0.032(0.000) |

. | ISED . | SCV . | NR . | PI . |
---|---|---|---|---|

Housing | −0.049(0.096) | −0.001(0.000) | −0.001(0.000) | −0.012(0.002) |

Concrete | −0.138(0.016) | −0.012(0.000) | −0.007(0.000) | −0.091(0.004) |

Yacht | −2.346(0.084) | −0.046(0.003) | −0.028(0.002) | −0.277(0.023) |

Auto MPG | −0.189(0.046) | −0.029(0.002) | −0.031(0.002) | −0.096(0.008) |

Servo | −0.008(0.001) | −0.009(0.001) | −0.008(0.001) | −0.010(0.002) |

Air Foil | −0.770(0.043) | −0.036(0.001) | −0.020(0.001) | −0.087(0.002) |

Power Plant | −0.036(0.002) | −0.028(0.000) | −0.026(0.000) | −0.032(0.000) |

Note: See the Table 3 note.

### 5.2 Density Derivative Estimation: ISED versus KDE with the Bandwidth Matrix

Recently, several methods to optimize for density derivative estimation have been proposed (Chacón & Duong, 2013). As in density estimation, a conventional idea is a direct extension of the normal reference rule (NR) for density derivative estimation (Chacón et al., 2011), while other ideas are to estimate by optimizing cross-validation criteria extended for density derivative estimation such as the unbiased cross-validation (UCV), smoothed cross-validation (SCV), and plug-in methods (PI). Here, we compare ISED to KDE with the bandwidth matrix in terms of density derivative estimation by performing the same experiments in section 3.

The results are presented in Figures 10, 11, and 12. In these experiments, except for ISED, we used the R library *ks* whose details are available in Duong (2007). When the dimensionality of data is small, the performance of PI and SCV is fairly good, but these methods tend to incur large errors for higher-dimensional data, particularly in second-order derivative estimation. For UCV and NR, for higher-dimensional data, their performance is comparable to or slightly better than ISED in second-order derivative estimation, while for other dimensions and first-order derivative estimation, ISED works better than these methods. Overall, the performance of ISED is reasonably good on a wide range of dimensions and in both first- and second-order density derivative estimation.

## 6 Future Directions

### 6.1 A Multitask Learning Extension

Here, we extend ISED by applying regularized multitask learning (Caruana, 1997; Evgeniou & Pontil, 2004; Argyriou et al., 2007). We first formulate the objective function according to the extension, and then propose two optimization algorithms. Finally, we discuss a variant of ISED as a limit of the MTL extension.

#### 6.1.1 An Extended Objective Function and Its Usefulness

The density derivative estimator has been independently derived with respect to each partial derivative so far. However, partial derivatives include common information because they are computed from the same density function. Thus, making use of this information might provide better estimates. To exploit the common information, Yamane, Sasaki, and Sugiyama (2015) applied the regularized multitask learning (MTL) (Caruana, 1997; Evgeniou & Pontil, 2004; Argyriou et al., 2007) to log-density gradient estimation and improved the performance when the sample size is small and the dimensionality of data is relatively high. Here, we follow and extend their approach to higher-order density derivative estimation.

*n*denotes the total number of partial derivatives,

_{k}^{2}denotes the tensor product, and the

*l*th element in and th elements in and are given by and are the vectors constructed by concatenating and , respectively, and is a block diagonal matrix whose diagonal blocks are for all . Minimizing equation 6.1 provides the solution , and minimization algorithms are described later.

#### 6.1.2 Optimization Algorithm

*k*are high because the size of is by , and we have to compute the inverse of a matrix with the same size.

*b*by

*b*matrix, which significantly saves the memories and, it is hoped, alleviates the computation burden if the number of iterations is not so large.

#### 6.1.3 A Variant of the MTL-Regularized ISED

The advantage of this variant is that all derivatives of any order can be estimated by solving a linear system of equations only once. Thus, this is computationally much more efficient than the original ISED and MTL extension. However, in exchange for computational efficiency, this variant sacrifices the model flexibility because the number of coefficients in equation 6.4 is reduced from the original linear-in-parameter model.

### 6.2 A General Framework for Density Derivative Estimation

Here, we discuss a general framework for density derivative estimation based on Bregman divergences (Bregman, 1967) and show another criterion for density derivative estimation, which includes the ISED criterion as a special case.

#### 6.2.1 Density Derivative Estimation Based on Bregman Divergences

*f*is a differentiable and strictly convex function and denotes the derivative of

*f*. Since on the right-hand side of equation 6.5, is the first-order Taylor expansion around

*t*evaluated at , the BR divergence measures the discrepancy between and the linear interpolation from

*t*and is minimized when . The BR divergence includes a variety of divergences as special cases by specifying the form of

*f*(Banerjee, Merugu, Dhillon, & Ghosh, 2005).

*f*yields a discrepancy whose minimizer is an estimator for density derivatives. Below, we provide an example of

*f*and derive a discrepancy for density derivative estimation.

#### 6.2.2 Basu’s Power Divergence for Density Derivatives

^{3}when is reduced to Basu’s power divergence (Basu, Harris, Hjort, & Jones, 1998): Then, the discrepancy for density derivatives based on Basu’s power divergence is given by The discrepancy is empirically approximated by Therefore, minimizing equation 6.8 provides another estimator for density derivatives.

Here, we show a connection to ISED. Interestingly, when , Basu’s power divergence is equivalent to the squared distance, and then is reduced to given by equation 2.3. Thus, ISED is a special case of the minimizer for equation 6.8.

*k*, using the empirical discrepancy, equation 6.8, would highlight the region where has large values. However, when , equation 6.8 is no longer convex (Basu et al., 1998; Sugiyama et al., 2012), and the solution we obtain might be a local minimum. Thus, in this letter, we focused on the case where equation 6.8 is convex, namely, , and the challenging case with will be our future work.

## 7 Conclusion

In this letter, we proposed a method to directly estimate density derivatives without performing density estimation. The proposed method, which we call the integrated squared error for density derivatives (ISED), was formulated as a least-squares problem, and thus its solution can be computed analytically. In addition, a cross-validation method is provided for tuning all hyperparameters in ISED. We theoretically proved that ISED achieves the optimal parametric convergence rate and experimentally showed that ISED tends to outperform a KDE-based method for density derivative estimation. Furthermore, we applied ISED to KL-divergence approximation and bandwidth matrix selection for KDE, and the usefulness of ISED was shown through applications to change detection, feature selection, and density estimation.

Density derivatives are useful quantities and could potentially have more applications in statistical data analysis. Therefore, in addition to the future directions discussed in section 6, we will look for novel applications of ISED.

## Appendix: Proof of Theorem ^{1}

In this appendix, we give the proof of theorem ^{1} by following Sasaki, Niu, and Sugiyama (2016), which is based on a theory of perturbed optimization (Bonnans & Shapiro, 1998). The proof employs proposition 6.1 in Bonnans and Shapiro (1998), which requires the growth condition (see definition 6.1 in Bonnans & Shapiro, 1998) and Lipschitz continuity. Therefore, we first establish them in lemmas ^{3} and ^{4}, respectively. Using these lemmas and the proposition, theorem ^{1} is finally proved.

### A.1 The Second-Order Growth Condition

We establish a second-order growth condition by the following lemma:

### A.2 Lipschitz Continuity

*M*is a positive constant. Given a -ball of which is , for , the following inequality holds: This inequality bounds the norm of the gradient as The above inequality shows that the order of the norm of the gradient is , that is, the difference function is Lipschitz continuous on with a Lipschitz constant of the same order.