Abstract

Estimating the derivatives of probability density functions is an essential step in statistical data analysis. A naive approach to estimate the derivatives is to first perform density estimation and then compute its derivatives. However, this approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator. To cope with this problem, in this letter, we propose a novel method that directly estimates density derivatives without going through density estimation. The proposed method provides computationally efficient estimation for the derivatives of any order on multidimensional data with a hyperparameter tuning method and achieves the optimal parametric convergence rate. We further discuss an extension of the proposed method by applying regularized multitask learning and a general framework for density derivative estimation based on Bregman divergences. Applications of the proposed method to nonparametric Kullback-Leibler divergence approximation and bandwidth matrix selection in kernel density estimation are also explored.

1  Introduction

The derivatives of probability density functions are important quantities in various problems of statistical data analysis—for instance:

  • Mean shift clustering updates data points toward the modes of an estimated data density by gradient ascent (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002; Sasaki, Hyvärinen, & Sugiyama, 2014). The first-order density-derivative is the essential tool.

  • The second-order density-derivative is the key ingredient of the optimal bandwidth parameter (Silverman, 1986) and the optimal bandwidth matrix (Wand & Jones, 1994b) in kernel density estimation (KDE) for uni- and multivariate data, respectively.

  • The bias of nearest-neighbor Kullback-Leibler (KL) divergence approximation depends on the second-order density derivatives (Noh et al., 2014).

  • An algorithm for independent component analysis was proposed based on the outer products of density gradients (Samarov & Tsybakov, 2004).

  • More statistical problems relying on density derivatives are discussed in Singh (1977a).

Therefore, accurate estimation of density derivatives is a fundamental research topic in statistical data analysis.

Given independent samples from the probability density on , a naive approach to estimate the density derivatives is to first perform density estimation and then to compute its derivatives (Bhattacharya, 1967; Schuster, 1969). By performing KDE with a kernel function K and the bandwidth parameter h,
formula
1.1
the kth order density derivative is computed by
formula
However, this approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator. This problem seems to be more critical if higher-order derivatives are computed from lower-order derivative estimates.
An alternative approach is to directly estimate density derivatives without density estimation. For instance, the estimation is performed by
formula
where is a function with certain conditions (Singh, 1977b). However, this method suffers from selecting the bandwidth parameter h in because the optimal value requires estimating a higher-order derivative than the target derivative (Singh, 1981, 1987).

In this letter, we propose a novel density derivative estimator with the following utilities:

  • Multidimensional density derivatives of any order can be directly estimated without density estimation.

  • The solution of the estimator can be efficiently computed in a closed form.

  • A cross-validation method is available for tuning all hyperparameters in the estimator.

The estimator is proposed as a minimizer of the integrated squared error (ISE) to the true density derivative and is called ISE for density derivatives (ISED). We theoretically prove that ISED achieves the parametric optimal convergence rate and experimentally demonstrate that ISED significantly outperforms a KDE-based density derivative estimator, especially for higher-dimensional data. Furthermore, as future directions of this work, we discuss an extension of ISED by applying the regularized multitask learning (Caruana, 1997; Evgeniou & Pontil, 2004; Argyriou, Evgeniou, & Pontil, 2007), and a rather general framework of density derivative estimation based on Bregman divergences (Bregman, 1967) where another estimator for density derivatives is derived as a generalization of ISED.

We further apply ISED to KL-divergence approximation and the optimal bandwidth matrix selection for KDE. Unlike previous work (Noh et al., 2014; Duong & Hazelton, 2005; Chacón & Duong, 2010), in both applications, we estimate the second-order derivatives directly by ISED without making use of any parametric models or any assumptions on the data density. The experimental results confirm the usefulness of ISED on various tasks, such as change detection, feature selection, and density estimation.

This letter is organized as follows. Section 2 formulates the problem of density derivative estimation and derives a novel estimator. A theoretical analysis of the proposed estimator is also provided. The performance of the proposed estimator is numerically illustrated in section 3. Applications to KL-divergence approximation and bandwidth matrix selection for KDE are explored in sections 4 and 5, respectively. Future directions of this work are discussed in section 6. Section 7 concludes this letter.

2  Direct Density Derivative Estimation

In this section, we propose a novel direct estimator for density derivatives.

2.1  Problem Formulation

Suppose that independent and identically distributed samples from the unknown density on are available. Our goal is to estimate the kth order partial derivative of ,
formula
2.1
where , denotes the transpose, for , and . When (or ), corresponds to a single element in the gradient vector (or the Hessian matrix) of .

2.2  Integrated Squared Error for Density Derivatives

Our approach is to directly fit a model to the true density derivative under the integrated squared error for density derivatives (ISED),
formula
2.2
where . To estimate , we derive the empirical version of . The first term in equation 2.2, which only includes a model specified by the user, is accessible. The second term seems inaccessible at a glance, but integration by parts allows us to transform it as
formula
where denotes the integration except for x1 and the last equality holds under a mild assumption that the tails of approach zero. The right-hand side on the last equality shows that the integration by parts decreases the order of the true density derivative to by differentiating the model. Thus, repeatedly applying integration by parts k times gives the following equation as
formula
By approximating the second term by the sample average, we reach the empirical version of ISED as
formula
2.3
As a density derivative model , we use a linear-in-parameter model with the gaussian kernel,
formula
where b denotes the number of basis functions and equals in the experiments of this letter, and center points are fixed to be b data samples randomly chosen from . Then the kth order derivative of the model is given by
formula
Substituting these formulas into equation 2.3 and adding the -regularizer yield a quadratic objective function,
formula
2.4
where
formula
The minimizer of equation 2.4 can be obtained analytically as
formula
where Ib denotes the b by b identity matrix. Finally, a density derivative estimator is obtained as
formula

We call this method the integrated squared error for density derivatives (ISED), which is a generalization of least-squares density-difference for density-difference estimation (Kim & Scott, 2010; Sugiyama et al., 2013), and is an extension of the direct estimator for log-density gradients (Cox, 1985; Sasaki et al., 2014) to higher-order derivatives.

2.3  Model Selection by Cross-Validation

The performance of the proposed method depends on the choice of model parameters such as the gaussian width and regularization parameter . Below, we describe a cross-validation method to select the optimal model:

  • Step 1. Divide the samples into T disjoint subsets .

  • Step 2. Obtain an estimator using (i.e., without ), and then compute the holdout ISED criterion for as
    formula
    2.5
    where denotes the number of elements in .
  • Step 3. Choose the model that minimizes .

This CV method is essentially same as the one proposed by Hardle, Marron, and Wand (1990) for density derivative estimation based on KDE.

2.4  Theoretical Analysis

Here, we theoretically investigate the behavior of ISED. To proceed to theoretical analysis, we define the optimal solution and estimator by
formula
where
formula

Our result of theoretical analysis is given in the following theorem:

Theorem 1.
Assume that is a positive definite matrix and converges to zero in . Given the optimal derivative and estimated derivative from a set of data samples of size n, then
formula
as , where denotes the probabilistic order.

The proof is found in the appendix.

Theorem 1 asserts that ISED is a consistent estimator and achieves the optimal parametric convergence rate under some assumptions. However, in practice, the positive definite assumption on might not always be satisfied. This problem can be avoided by adding the regularizer with an appropriate regularization parameter . Under this setting, we define the optimal solution and estimator as
formula
Then, we obtain a similar statement as a corollary of theorem 1:
Corollary 1.
Given the optimal derivative and estimated derivative from a set of data samples of size n, then as ,
formula
provided that converges in to the nonzero limits such that , and there exists such that .

3  Illustration of Density Derivative Estimation

In this section, we numerically illustrate the behavior of ISED on artificial data and compare it with a KDE-based density derivative estimator.

In this illustration, data samples were drawn from the standard normal density. We performed model selection based on cross-validation as described in section 2.3: the gaussian bandwidth parameter was chosen from the nine candidates from to at the regular interval in logarithmic scale, and the regularization parameter was chosen from the nine candidates from to at the regular interval in logarithmic scale. For KDE, the gaussian kernel was employed, and the bandwidth parameter was chosen as follows:

  • KDE. The bandwidth parameter was cross-validated with respect to the holdout least-squared error of the estimated density to the true density.

  • KDE. The bandwidth parameter was cross-validated with respect to the holdout ISED criterion, equation 2.5.

For KDE, we chose the bandwidth parameter from the nine candidates from to at the regular interval in logarithmic scale. The cross-validation method of KDE is the same as the one that Hardle et al. (1990) proposed, and in the cross-validation, we used the same candidates as ISED.

Figures 1a and 1b show that ISED produces more accurate estimates than both KDE and KDE in density derivative estimation. Figure 1c depicts the density estimate obtained by integrating the ISED estimate of the first-order density derivative, indicating that ISED well approximates the density function up to an unknown constant. KDE works well in density estimation but produces considerable errors in density derivative estimation, particularly for the second-order derivative. This result highlights that a good density estimator is not necessarily a good density derivative estimator. The density derivative estimates from KDE are more accurate than KDE, but they are not as accurate as ISED. Furthermore, KDE provides an oversmoothed density estimate, which is less accurate than ISED.

Figure 1:

Illustrative examples of (a,b) density derivative estimation and (c) density estimation. (d,e) Normalized mean squared errors in density derivative estimation as functions of the data dimension.

Figure 1:

Illustrative examples of (a,b) density derivative estimation and (c) density estimation. (d,e) Normalized mean squared errors in density derivative estimation as functions of the data dimension.

Next, we investigate the scalability of ISED with respect to the data dimension for the standard normal as the target density. The performance was measured by the normalized mean squared error (MSE),
formula
where denotes the summation over all elements in the gradient vector (or the Hessian matrix) when (or ). We used the common and in ISED for all elements in the gradient vector or Hessian matrix, which was selected by cross-validation with respect to the hold-out ISED criterion 2.5 summed over all elements. Figures 1d and 1e show that ISED provides more accurate density derivative estimates than KDE, especially in higher-dimensional problems.

In addition to the standard normal density, we performed the same experiments using a mixture of two gaussians and independent log-normal density:

  • For the mixture of two gaussians, all elements in one mean vector were 2, while the elements in the other mean vector were . The covariance matrices in the two gaussians were the identity matrix, and the mixing coefficients were both .

  • For the independent log-normal density, the joint density was given by
    formula
    3.1
    In this experiment, we set and for all j.

For cross-validation of ISED and KDEI in the log-normal density, we set nine candidates for their bandwidth parameters from to at the regular interval in logarithmic scale because the scale is different from other densities. For KDES, we set the nine candidates from to in the log-normal density. As presented in Figures 2 and 3, the results further substantiate the superior performance of ISED.

Figure 2:

Normalized mean squared errors in density derivative estimation as functions of the data dimension when the data density is a mixture of two gaussians.

Figure 2:

Normalized mean squared errors in density derivative estimation as functions of the data dimension when the data density is a mixture of two gaussians.

Figure 3:

Normalized mean squared errors in density derivative estimation as functions of the data dimension when the data density is a log-normal density.

Figure 3:

Normalized mean squared errors in density derivative estimation as functions of the data dimension when the data density is a log-normal density.

4  KL-Divergence Approximation

In this section, we apply ISED to KL-divergence approximation.

4.1  Nearest-Neighbor KL-Divergence Approximation

We consider the problem of approximating the KL-divergence between two densities and ,
formula
from the two sets of samples and drawn independently from and , respectively. The KL-divergence plays a key role in various tasks such as two-sample homogeneity testing (Kanamori, Suzuki, & Sugiyama, 2012), feature selection (Brown, 2009), and change detection (Liu, Yamada, Collier, & Sugiyama, 2013). In this letter, we focus on a nonparametric KL-divergence approximator based on nearest-neighbor density estimation (NNDE) (Wang, Kulkarni, & Verdu, 2006) given by
formula
4.1
where and denote the distance from the nearest sample except for in and , respectively.

4.2  Metric Learning with Parametric Bias Estimation for NNDE-Based KL-Divergence Approximation

In practical situations, nonparametric approximators tend to suffer from high bias because of the finite number of samples. Indeed, Noh et al. (2014) showed that the bias of the NNDE-based KL-divergence approximator at is approximately proportional to
formula
where and are the Hessian matrices.
One way to reduce the bias is to learn appropriate distance metric in the KL-divergence approximator, equation 4.1, while the KL-divergence itself is metric invariant. In Noh et al. (2014), the best local Mahalanobis metric at that minimizes the estimated bias was given as the solution of the following optimization problem:
formula
where denotes the determinant of , means that is positive definite, and
formula
It was shown that the solution is given analytically up to a scaling factor as
formula
where and are the diagonal matrices that contain positive and negative eigenvalues of , respectively. The matrices and share the same eigenvectors, and and are collections of the eigenvectors corresponding to the eigenvalues in and , respectively.

To compute the bias matrix , Noh et al. (2014) assumed that p1 and p2 are both nearly gaussian, and estimated densities p1 and p2 as well as their Hessian matrices and using the gaussian parametric model with maximum likelihood estimation. The accuracy of NNDE-based KL-divergence approximation was significantly improved when p1 and p2 are nearly gaussian.

4.3  Metric Learning with Nonparametric Bias Estimation by ISED for NNDE-Based KL-Divergence Approximation

The evident limitation of the above method is the gaussian assumption on p1 and p2 when estimating . Thus, its applicability would be rather limited. To extend the applicability, a naive approach to nonparametrically estimate is to first perform nonparametric density estimation for p1 and p2 separately, then compute their Hessian matrices and , and finally plug them into the definition of . However, such a plug-in approach can be unreliable because a good density estimator does not necessarily mean a good estimator of its Hessian matrix, as already demonstrated in section 3. In addition, division by estimated densities in could significantly magnify the estimation errors of Hessian matrices.

To avoid density estimation and division, we use the following rescaled matrix because the scale of does not affect the estimation of :
formula
4.2
We then estimate the Hessian matrices and by ISED and the density ratio by the unconstrained least-squares density-ratio estimator (Kanamori, Hido, & Sugiyama, 2009) that directly estimates the density ratio in a nonparametric way without estimating each density and computing the division. Thus, we perform metric learning based on in a completely nonparametric way without going through density estimation.

4.4  Illustration of KL-Divergence Approximation

Here, we compare the performance of the proposed KL-divergence approximator based on ISED with the following approximators:

  • NN: NNDE-based KL-divergence approximator without metric learning (Wang et al., 2006)

  • NNG: NNDE-based KL-divergence approximator with gaussian-based metric learning (Noh et al., 2014)

  • KDE: NNDE-based KL-divergence approximator with KDE-based metric learning

  • Ratio: Density-ratio-based nonparametric KL-divergence approximator (Nguyen, Wainwright, & Jordan, 2010)

  • fRisk: Risk-based nearest-neighbor KL-divergence approximator (Garcia-Garcia, von Luxburg, & Santos-Rodriguez, 2011)

  • GP: Gaussian parametric KL-divergence approximator with maximum likelihood estimation.

We generated data samples using the generalized gaussian distribution, which is defined by
formula
where denotes the mean, controls the variance, and controls the gaussianity: , , and correspond to supergaussian, gaussian, and subgaussian distributions, respectively. For p1 and p2, we set
formula
where the value of is selected so that the variance is one. For the cross-validation of ISED, we used the nine candidates from to for and to for at the regular interval in logarithmic scale. For KDE, the density derivative estimation and density estimation were separately performed by KDE and KDE, respectively, and these estimates were substituted into equation 4.2. The bandwidth parameters in both KDE and KDE were cross-validated as in the previous experiment. The true KL-divergence value was approximated from samples by the sample average.

Figure 4 shows the experimental results of KL-divergence approximation for the generalized gaussian density as functions of the data dimension d when . In the figures, the relative KL-divergence is defined by the approximated KL-divergence values subtracted by the true KL-divergence value. For all data, ISED works well to a wide range of dimensions. NN, NNG, and GP produce good performance for supergaussian and gaussian data, but they strongly underestimate the KL-divergence for subgaussian data. The performance of KDE in all data is not good or shows highvariance, especially for high-dimensional data. fRisk is comparable to ISED for subgaussian data but overestimates the KL-divergence for other data. Ratio systematically underestimates it for all data. Figure 5 shows that as the sample size n increases, ISED tends to produce more accurate estimates for all data. For gaussian data, ISED is comparable even to GP and NNG, both of which assume the gaussian density. NN and Ratio underestimate for all data, while fRisk overestimates for supergaussian and gaussian data. Figure 6 further illustrates that ISED performs well with different mean parameter values

Figure 4:

KL-divergence approximation as functions of the data dimension d over runs when and . The relative KL-divergence means the approximated KL-divergence subtracted by the true KL-divergence value.

Figure 4:

KL-divergence approximation as functions of the data dimension d over runs when and . The relative KL-divergence means the approximated KL-divergence subtracted by the true KL-divergence value.

Figure 5:

KL-divergence approximation as functions of the sample size n over runs when and .

Figure 5:

KL-divergence approximation as functions of the sample size n over runs when and .

Figure 6:

KL-divergence approximation as functions of the mean parameter over runs when and .

Figure 6:

KL-divergence approximation as functions of the mean parameter over runs when and .

Moreover, we investigate the performance of ISED using mixtures of two gaussians and log-normal densities:

  • Using a mixture of two gaussians, data are generated from the following densities,
    formula
    where denotes a mixture of two gaussians whose means are 3 and and covariance matrices are identity. is also a mixture of two gaussians, but their means are 2 and . The mixing coefficients are both in and .
  • For the log-normal densities,
    formula
    where denotes the log-normal density defined by equation 3.1.

For mixtures of two gaussians, the hyperparameters in ISED were cross-validated as in the last experiments, but for the log-normal densities, we used the nine candidates from to for because the scale is different from other densities. The results presented in Figure 7 indicate that the KL-divergence approximator based on ISED performs well to both densities as well.

Figure 7:

KL-divergence approximation for (a) mixtures of two gaussians and (b) log-normal densities as functions of the data dimension d over runs when .

Figure 7:

KL-divergence approximation for (a) mixtures of two gaussians and (b) log-normal densities as functions of the data dimension d over runs when .

4.5  Distributional Change Detection

The goal of change detection is to find abrupt changes in time series data. Here, we extract a time window vector at time t, and a collection of time window vectors is obtained by sliding the windows as
formula
Following Liu et al. (2013), we assume the existence of an underlying density function that generates r retrospective vectors in . To find change points, as a change score, we use the KL-divergence between the underlying density functions of the two sets, and for every t: if the KL-divergence for and is greater than a threshold, a point is determined as a change point. In this experiment, we set and .

We use the Human Activity Sensing Consortium (HASC) Challenge 2011 data collection,1 which provides human activity information collected by a portable three-axis accelerometer. Our task is to segment different activities such as “stay,” “walk,” “jog,” and “skip.” As in Liu et al. (2013), we took the -norm of three-dimensional accelerometer data and obtained one-dimensional data because the orientation of the accelerometer is not necessarily fixed.

Two examples of time series data and the KL divergence approximated by ISED are illustrated in Figure 8. The figure shows that the approximated KL-divergence tends to be large around the change points, and thus KL-divergence can be used as change scores. Table 1 summarizes the results over 10 runs where a point was classified as a change point if it was within a small tolerance region ( ms around an “exact” change point) and the performance was measured by the area under the ROC curve (AUC). From the table, we can confirm that ISED significantly outperforms GP and NNG and is comparable to fRisk. The best performance of fRisk could be interpreted in the following way. As reported by Noh et al. (2014), fRisk keeps the values of the estimated KL-divergence similar to different data densities. This property works as a “regularizer” to stabilize the change scores, although it is poor as a KL-divergence approximator.

Figure 8:

HASC time series data (top) and the KL-divergence approximated by ISED (bottom). The green cross-symbols indicate the true change points.

Figure 8:

HASC time series data (top) and the KL-divergence approximated by ISED (bottom). The green cross-symbols indicate the true change points.

Table 1:
Means and Standard Deviations of the Area under the ROC Curve (AUC) over Ten Runs.
GPNNGfRiskISED
0.747(0.050) 0.822(0.030) 0.858(0.022) 0.837(0.027) 
GPNNGfRiskISED
0.747(0.050) 0.822(0.030) 0.858(0.022) 0.837(0.027) 

Note: The best method and methods comparable to the best one in terms of the mean AUC by the unpaired t-test with significance level are highlighted in bold.

4.6  Information Theoretical Feature Selection

Next, we apply the proposed KL-divergence approximator to select relevant features in classification. One of the conventional criteria for selecting features from data with binary class labels is the t-score, defined by
formula
where (and ) and (and ) are the mean value and standard deviation of the class (and ), respectively. However, the limitation of the t-score is that it does not take into account the correlation and redundancy between features. As recent alternative criterion, the Jensen-Shannon (JS) divergence has been used (Bontempi & Meyer, 2010) as
formula
where . The JS divergence is known as the Shannon mutual information between binary class labels y and features . Previously the JS divergence has been approximated parametrically (Bontempi & Meyer, 2010). In contrast to that, here, we nonparametrically approximate it by using the proposed KL-divergence approximator.

In this experiment, two gene expression data sets of breast cancer prognosis studies, SMK-CAN-187 (Freije et al., 2004) and VANTVEER (van’t Veer et al., 2002), were used. The SMK-CAN-187 data set contains 90 positive (alive) and 97 negative (dead after five years) samples with 19,993 features. We randomly selected 65 samples per class for training and used the rest for evaluating the classification performance. The VANTVEER data set contains 46 positive and 51 negative samples with 24,481 features. Thirty-five samples per class were randomly selected for training, and the rest were used for testing.

For comparison, we applied the following methods to the data sets:

  • ISED: A nonparametric JS divergence approximator based on the proposed KL-divergence approximator

  • NNG: A nonparametric JS divergence approximator based on the NNDE-based KL-divergence approximator with gaussian-based metric learning (Noh et al., 2014)

  • NN: A nonparametric JS divergence approximator based on the NNDE-based KL-divergence approximator (Wang et al., 2006)

  • mIMR: A parametric JS divergence approximator (Bontempi & Meyer, 2010)

  • t-score: A conventional feature selection criterion

We chose features based on the forward selection strategy and measured the performance by the AUC of classification. The results are shown in Table 2, confirming the reasonable performance of the proposed method in feature selection.

Table 2:
Means and Standard Deviations of the Area under the ROC Curve (AUC) for Two Data Sets over 50 Runs.
Data SetISEDNNGNNmIMRt-score
SMK-CAN-187 0.721(0.008) 0.739(0.008) 0.739(0.007) 0.732(0.008) 0.707(0.010) 
VANTVEER 0.753(0.008) 0.723(0.011) 0.729(0.010) 0.718(0.011) 0.702(0.011) 
Data SetISEDNNGNNmIMRt-score
SMK-CAN-187 0.721(0.008) 0.739(0.008) 0.739(0.007) 0.732(0.008) 0.707(0.010) 
VANTVEER 0.753(0.008) 0.723(0.011) 0.729(0.010) 0.718(0.011) 0.702(0.011) 

Note: The best method and methods comparable to the best one in terms of the mean AUC by the unpaired t-test with significance level are highlighted in bold.

5  Kernel Density Estimation with the Bandwidth Matrix

In this section, we compare the proposed method to KDE with the bandwidth matrix in terms of both density and density derivative estimation.

5.1  Density Estimation

Here, we first review the existing methods for bandwidth matrix selection in KDE and then propose a method for bandwidth matrix selection based on ISED.

5.1.1  Existing Methods for Bandwidth Matrix Selection

Estimating probability density functions is a general approach to solve various problems in statistical data analysis. KDE is one of the most popular methods in density estimation and has been used in a wide range of fields. A general and flexible form includes the bandwidth matrix as
formula
5.1
where is a symmetric positive definite matrix. The key challenge is the bandwidth matrix selection, which is even more difficult than the single bandwidth parameter selection.
Compared with the single bandwidth parameter selection, the number of works for bandwidth matrix selection is relatively few. An approach for bandwidth matrix selection is to directly extend cross-validation criteria for the single bandwidth parameter to the bandwidth matrix. For instance, Duong and Hazelton (2005) derived an unbiased CV criterion (Rudemo, 1982; Bowman, 1984) for the bandwidth matrix as
formula
where denotes convolution, , and
formula
Furthermore, the biased CV (Scott & Terrell, 1987; Jones & Kappenman, 1992) and smoothed CV (SCV) (Hall, Marron, & Park, 1992) criteria for the single bandwidth parameter were also extended in Duong and Hazelton (2005). To estimate the bandwidth matrix, each criterion is numerically minimized, which generally depends on initialization of . In Duong and Hazelton (2005), based on the normal reference rule (Silverman, 1986), an initial guess for is computed as
formula
where is the sample covariance matrix. However, when the true data density is far from the gaussian density, this initialization scheme may not work well.
Another way to select the bandwidth matrix is to minimize the asymptotic mean integrated squared error (AMISE), which was derived in Wand and Jones (1994a):
formula
5.2
where , denotes the vectorization of , and
formula
Equation 5.2 can be equivalently expressed as
formula
5.3
where denotes the vectorization of the elements in the lower-triangle part of , is a by matrix with elements given by
formula
The difficulty in using AMISE is that it includes the derivatives of the true data density. Chacón, Duong, and Wand (2011) assumed that the data density is gaussian, computed the derivatives of the gaussian density, and estimated in a closed form. However, the gaussian assumption can be easily violated in practice.
A more flexible approach to minimize AMISE is to employ pilot estimates of density derivatives in (Chacón & Duong, 2010). The pilot estimates are obtained by computing the derivatives of another KDE using a kernel function L with a bandwidth matrix . A key challenge in this approach is to estimate . Chacón and Duong (2010) estimated as a minimizer of the following mean squared error (MSE),
formula
5.4
where is the vector including as an element and is its estimate by the KDE. To improve pilot estimates, Chacón and Duong (2010) further performed two-stage estimation for : at the first stage, is estimated by minimizing MSE for the higher-order , and the estimated , denoted by , is used to approximate at the second stage. Then, , the final estimate of , is obtained by minimizing MSE, equation 5.4. However, to obtain at the first stage, needs to be known. To cope with this problem, Chacón and Duong (2010) computed under the assumption that is gaussian, but the gaussian assumption does not always work well in general.

5.1.2  Bandwidth Matrix Selection Based on ISED

As reviewed above, the existing methods for bandwidth matrix selection rely on the gaussian assumption at a certain phase. The method proposed in Horová, Koláček, and Vopatová (2013) does not employ gaussian assumptions, but its practical algorithm restricts the structure of the bandwidth matrix and reduces the bandwidth matrix selection to the single bandwidth parameter selection. Unlike these methods, we estimate the bandwidth matrix without making any strong assumptions of the data density and restricting the structure of .

We follow the approach of minimizing AMISE but remove the gaussian assumption in previous works: the Hessian matrix is estimated by ISED, and the estimated Hessian matrix is plugged into to compute AMISE, equation 5.2. Since the bandwidth matrix must be symmetric positive definite (SPD), using the MATLAB software Manopt (Boumal, Mishra, Absil, & Sepulchre, 2014), we perform manifold optimization to minimize AMISE, equation 5.2, with respect to on the SPD manifold.

5.1.3  Numerical Illustration on Artificial and Benchmark Data

Here, we experimentally investigate the performance of the proposed bandwidth matrix selection method. For comparison, we apply the following methods to the same data sets:

  • ISED: A bandwidth matrix selector minimizing AMISE with the ISED plug-in estimator for

  • SCV: A bandwidth matrix selector based on minimizing the smoothed cross-validation criterion (Hall et al., 1992; Duong & Hazelton, 2005)

  • NR: A bandwidth matrix selector minimizing AMISE based on the normal reference rule (gaussian assumption) for the data density (Chacón et al., 2011)

  • PI: A bandwidth matrix selector minimizing AMISE with pilot estimates for (Chacón & Duong, 2010)

Except for ISED, we used the R library ks whose details are available in Duong (2007).

Data samples were drawn from the following six densities:

  • Density 1: A gaussian density with mean and covariance matrix . has a tridiagonal shape: the diagonals are 1, for all i are and the other off-diagonals are 0.

  • Density 2: A mixture of two gaussians with means and and covariance matrices .

  • Density 3: A mixture of two gaussians with means and and covariance matrices and . The diagonals in and are 1, the th and th elements in and are , and the others are 0.

  • Density 4: A mixture of two gaussians with means and and covariance matrices and . The diagonals in are 1, the th and th elements in are , and the others are 0.

  • Density 5: A mixture of three gaussians with means , , and and covariance matrices .

  • Density 6: A mixture of three gaussians with means , , and and covariance matrices , and . The diagonals in and are 1, the th and th elements are for and for , and the others are 0.

The mixing coefficients from density 2 to density 4 were and , while for density 5 and density 6, they were , and , respectively. The two-dimensional distributions for the data densities are depicted in Figure 9. Since the true density is known for these artificial data, the performance was measured by the integrated squared error (ISE) to the true density,
formula
where denotes an estimated bandwidth matrix.
Figure 9:

The two-dimensional distributions of the six data densities.

Figure 9:

The two-dimensional distributions of the six data densities.

In addition to artificial data, we applied the methods to benchmark data sets (Bache & Lichman, 2013). Since these data sets were originally intended for regression, we used only the input data. First, we randomly divided the data set into n training samples and test samples. Using the training samples, the bandwidth matrix was estimated. We measured the performance by the empirical ISE using the test data:
formula

The results over runs on artificial data are summarized in Tables 3 to 8. For the gaussian density (see Table 3), SCV or NR is the best method because these methods make use of the gaussian assumption. However, the performance of ISED is not so bad. Tables 4 to 8 show that when the data density is mixtures of gaussians, ISED always outperforms the other methods. The reason would be that ISED does not have any strong assumptions on the data density, while the other methods assume that the data density is gaussian at a certain phase. Thus, the proposed method based on ISED is quite promising. Furthermore, from Table 9 for the benchmark data sets, we can confirm that the proposed method reasonably works well on benchmark data sets.

Table 3:
Averages of the Logarithms of ISE Values over 100 Runs for Density 1.
ISEDSCVNRPI
 −6.073(0.254) −6.098(0.275) −6.110(0.241) −5.940(0.228) 
 −6.429(0.166) −6.535(0.201) −6.504(0.159) −6.182(0.128) 
 −7.296(0.131) −7.361(0.173) −7.286(0.125) −6.782(0.101) 
 −8.040(0.109) −8.137(0.137) −8.014(0.090) −7.295(0.095) 
 −6.584(0.212) −6.594(0.246) −6.609(0.205) −6.493(0.177) 
 −6.853(0.153) −6.932(0.184) −6.931(0.151) −6.706(0.117) 
 −7.647(0.110) −7.682(0.139) −7.658(0.109) −7.284(0.070) 
 −8.334(0.085) −8.397(0.109) −8.341(0.079) −7.782(0.055) 
 −6.962(0.207) −6.962(0.231) −6.981(0.205) −6.900(0.183) 
 −7.198(0.135) −7.256(0.165) −7.260(0.135) −7.089(0.103) 
 −7.934(0.104) −7.960(0.134) −7.950(0.107) −7.651(0.068) 
 −8.579(0.076) −8.628(0.100) −8.597(0.078) −8.132(0.045) 
ISEDSCVNRPI
 −6.073(0.254) −6.098(0.275) −6.110(0.241) −5.940(0.228) 
 −6.429(0.166) −6.535(0.201) −6.504(0.159) −6.182(0.128) 
 −7.296(0.131) −7.361(0.173) −7.286(0.125) −6.782(0.101) 
 −8.040(0.109) −8.137(0.137) −8.014(0.090) −7.295(0.095) 
 −6.584(0.212) −6.594(0.246) −6.609(0.205) −6.493(0.177) 
 −6.853(0.153) −6.932(0.184) −6.931(0.151) −6.706(0.117) 
 −7.647(0.110) −7.682(0.139) −7.658(0.109) −7.284(0.070) 
 −8.334(0.085) −8.397(0.109) −8.341(0.079) −7.782(0.055) 
 −6.962(0.207) −6.962(0.231) −6.981(0.205) −6.900(0.183) 
 −7.198(0.135) −7.256(0.165) −7.260(0.135) −7.089(0.103) 
 −7.934(0.104) −7.960(0.134) −7.950(0.107) −7.651(0.068) 
 −8.579(0.076) −8.628(0.100) −8.597(0.078) −8.132(0.045) 

Notes: The numbers in parentheses are their standard deviations. The best method and methods judged to be comparable to the best one by the unpaired t-test at the significance level are in bold. d and n denote the dimensionality of data and number of samples, respectively.

Table 4:
Averages of the logarithms of ISE Values over 100 Runs for Density 2.
ISEDSCVNRPI
 −6.662(0.206) −6.383(0.170) −6.402(0.139) −6.509(0.183) 
 −7.441(0.131) −7.215(0.119) −7.274(0.106) −7.178(0.102) 
 −8.330(0.105) −8.171(0.096) −8.205(0.082) −7.913(0.075) 
 −9.313(0.123) −9.207(0.077) −9.183(0.061) −8.654(0.082) 
 −7.126(0.172) −6.900(0.184) −6.775(0.124) −6.992(0.161) 
 −7.817(0.111) −7.600(0.124) −7.597(0.096) −7.615(0.095) 
 −8.649(0.085) −8.459(0.087) −8.498(0.072) −8.338(0.059) 
 −9.590(0.070) −9.427(0.072) −9.451(0.060) −9.082(0.049) 
 −7.509(0.153) −7.322(0.172) −7.070(0.113) −7.378(0.143) 
 −8.153(0.108) −7.948(0.128) −7.863(0.091) −7.967(0.094) 
 −8.915(0.078) −8.723(0.085) −8.724(0.067) −8.647(0.057) 
 −9.804(0.058) −9.625(0.061) −9.652(0.048) −9.384(0.036) 
ISEDSCVNRPI
 −6.662(0.206) −6.383(0.170) −6.402(0.139) −6.509(0.183) 
 −7.441(0.131) −7.215(0.119) −7.274(0.106) −7.178(0.102) 
 −8.330(0.105) −8.171(0.096) −8.205(0.082) −7.913(0.075) 
 −9.313(0.123) −9.207(0.077) −9.183(0.061) −8.654(0.082) 
 −7.126(0.172) −6.900(0.184) −6.775(0.124) −6.992(0.161) 
 −7.817(0.111) −7.600(0.124) −7.597(0.096) −7.615(0.095) 
 −8.649(0.085) −8.459(0.087) −8.498(0.072) −8.338(0.059) 
 −9.590(0.070) −9.427(0.072) −9.451(0.060) −9.082(0.049) 
 −7.509(0.153) −7.322(0.172) −7.070(0.113) −7.378(0.143) 
 −8.153(0.108) −7.948(0.128) −7.863(0.091) −7.967(0.094) 
 −8.915(0.078) −8.723(0.085) −8.724(0.067) −8.647(0.057) 
 −9.804(0.058) −9.625(0.061) −9.652(0.048) −9.384(0.036) 

Note: See the Table 3 note.

Table 5:
Averages of the Logarithms of ISE Values over 100 Runs for Density 3.
ISEDSCVNRPI
 −6.016(0.190) −4.834(0.038) −4.431(0.014) −4.976(0.045) 
 −6.790(0.136) −5.832(0.034) −5.624(0.016) −5.924(0.035) 
 −7.693(0.116) −6.924(0.027) −6.826(0.015) −6.978(0.024) 
 −8.684(0.129) −8.072(0.020) −8.034(0.013) −8.067(0.020) 
 −6.495(0.174) −5.217(0.043) −4.523(0.011) −5.354(0.048) 
 −7.209(0.113) −6.109(0.034) −5.704(0.013) −6.184(0.035) 
 −8.036(0.085) −7.124(0.024) −6.896(0.011) −7.166(0.022) 
 −8.973(0.076) −8.220(0.021) −8.099(0.012) −8.228(0.016) 
 −6.885(0.142) −5.576(0.046) −4.599(0.009) −5.712(0.051) 
 −7.553(0.108) −6.360(0.034) −5.767(0.010) −6.415(0.033) 
 −8.330(0.079) −7.309(0.025) −6.952(0.010) −7.332(0.023) 
 −9.208(0.059) −8.357(0.019) −8.148(0.009) −8.357(0.016) 
ISEDSCVNRPI
 −6.016(0.190) −4.834(0.038) −4.431(0.014) −4.976(0.045) 
 −6.790(0.136) −5.832(0.034) −5.624(0.016) −5.924(0.035) 
 −7.693(0.116) −6.924(0.027) −6.826(0.015) −6.978(0.024) 
 −8.684(0.129) −8.072(0.020) −8.034(0.013) −8.067(0.020) 
 −6.495(0.174) −5.217(0.043) −4.523(0.011) −5.354(0.048) 
 −7.209(0.113) −6.109(0.034) −5.704(0.013) −6.184(0.035) 
 −8.036(0.085) −7.124(0.024) −6.896(0.011) −7.166(0.022) 
 −8.973(0.076) −8.220(0.021) −8.099(0.012) −8.228(0.016) 
 −6.885(0.142) −5.576(0.046) −4.599(0.009) −5.712(0.051) 
 −7.553(0.108) −6.360(0.034) −5.767(0.010) −6.415(0.033) 
 −8.330(0.079) −7.309(0.025) −6.952(0.010) −7.332(0.023) 
 −9.208(0.059) −8.357(0.019) −8.148(0.009) −8.357(0.016) 

Note: See the Table 3 note.

Table 6:
Averages of the Logarithms of ISE Values over 100 Runs for Density 4.
ISEDSCVNRPI
 −6.177(0.193) −5.233(0.058) −4.837(0.023) −5.378(0.065) 
 −6.982(0.138) −6.199(0.048) −5.999(0.025) −6.301(0.047) 
 −7.906(0.116) −7.260(0.037) −7.174(0.023) −7.313(0.031) 
 −8.918(0.102) −8.382(0.027) −8.357(0.020) −8.351(0.029) 
 −6.641(0.170) −5.622(0.062) −4.961(0.018) −5.746(0.069) 
 −7.375(0.113) −6.502(0.047) −6.110(0.020) −6.580(0.045) 
 −8.227(0.082) −7.488(0.032) −7.272(0.017) −7.535(0.028) 
 −9.185(0.074) −8.559(0.029) −8.448(0.017) −8.556(0.019) 
 −7.039(0.143) −5.960(0.058) −5.059(0.014) −6.071(0.064) 
 −7.716(0.103) −6.757(0.044) −6.193(0.016) −6.810(0.039) 
 −8.501(0.078) −7.689(0.033) −7.347(0.015) −7.716(0.029) 
 −9.401(0.060) −8.710(0.025) −8.515(0.014) −8.708(0.020) 
ISEDSCVNRPI
 −6.177(0.193) −5.233(0.058) −4.837(0.023) −5.378(0.065) 
 −6.982(0.138) −6.199(0.048) −5.999(0.025) −6.301(0.047) 
 −7.906(0.116) −7.260(0.037) −7.174(0.023) −7.313(0.031) 
 −8.918(0.102) −8.382(0.027) −8.357(0.020) −8.351(0.029) 
 −6.641(0.170) −5.622(0.062) −4.961(0.018) −5.746(0.069) 
 −7.375(0.113) −6.502(0.047) −6.110(0.020) −6.580(0.045) 
 −8.227(0.082) −7.488(0.032) −7.272(0.017) −7.535(0.028) 
 −9.185(0.074) −8.559(0.029) −8.448(0.017) −8.556(0.019) 
 −7.039(0.143) −5.960(0.058) −5.059(0.014) −6.071(0.064) 
 −7.716(0.103) −6.757(0.044) −6.193(0.016) −6.810(0.039) 
 −8.501(0.078) −7.689(0.033) −7.347(0.015) −7.716(0.029) 
 −9.401(0.060) −8.710(0.025) −8.515(0.014) −8.708(0.020) 

Note: See the Table 3 note.

Table 7:
Averages of the Logarithms of ISE Values over 100 Runs for Density 5.
ISEDSCVNRPI
 −6.855(0.153) −6.467(0.131) −6.008(0.059) −6.712(0.146) 
 −7.668(0.115) −7.263(0.097) −7.060(0.052) −7.455(0.101) 
 −8.598(0.091) −8.206(0.067) −8.146(0.047) −8.313(0.068) 
 −9.602(0.141) −9.246(0.049) −9.257(0.040) −9.227(0.051) 
 −7.314(0.134) −7.013(0.136) −6.278(0.056) −7.181(0.135) 
 −8.046(0.110) −7.697(0.101) −7.277(0.049) −7.850(0.100) 
 −8.908(0.083) −8.540(0.075) −8.329(0.045) −8.650(0.070) 
 −9.859(0.078) −9.502(0.055) −9.418(0.037) −9.529(0.042) 
 −7.678(0.149) −7.442(0.162) −6.501(0.059) −7.559(0.159) 
 −8.355(0.105) −8.059(0.113) −7.463(0.048) −8.165(0.109) 
 −9.157(0.076) −8.825(0.070) −8.480(0.037) −8.911(0.065) 
 −10.072(0.061) −9.727(0.055) −9.548(0.034) −9.758(0.042) 
ISEDSCVNRPI
 −6.855(0.153) −6.467(0.131) −6.008(0.059) −6.712(0.146) 
 −7.668(0.115) −7.263(0.097) −7.060(0.052) −7.455(0.101) 
 −8.598(0.091) −8.206(0.067) −8.146(0.047) −8.313(0.068) 
 −9.602(0.141) −9.246(0.049) −9.257(0.040) −9.227(0.051) 
 −7.314(0.134) −7.013(0.136) −6.278(0.056) −7.181(0.135) 
 −8.046(0.110) −7.697(0.101) −7.277(0.049) −7.850(0.100) 
 −8.908(0.083) −8.540(0.075) −8.329(0.045) −8.650(0.070) 
 −9.859(0.078) −9.502(0.055) −9.418(0.037) −9.529(0.042) 
 −7.678(0.149) −7.442(0.162) −6.501(0.059) −7.559(0.159) 
 −8.355(0.105) −8.059(0.113) −7.463(0.048) −8.165(0.109) 
 −9.157(0.076) −8.825(0.070) −8.480(0.037) −8.911(0.065) 
 −10.072(0.061) −9.727(0.055) −9.548(0.034) −9.758(0.042) 

Note: See the Table 3 note.

Table 8:
Averages of the Logarithms of ISE Values over 100 Runs for Density 6.
ISEDSCVNRPI
 −6.276(0.128) −5.706(0.077) −5.350(0.036) −5.959(0.091) 
 −7.150(0.109) −6.644(0.061) −6.483(0.033) −6.836(0.065) 
 −8.117(0.118) −7.684(0.043) −7.633(0.031) −7.805(0.048) 
 −9.164(0.142) −8.794(0.034) −8.797(0.028) −8.818(0.037) 
 −6.699(0.114) −6.138(0.081) −5.524(0.033) −6.362(0.090) 
 −7.515(0.095) −6.972(0.059) −6.625(0.028) −7.149(0.063) 
 −8.426(0.069) −7.932(0.045) −7.755(0.027) −8.061(0.044) 
 −9.406(0.065) −8.980(0.035) −8.907(0.024) −9.044(0.029) 
 −7.066(0.109) −6.510(0.083) −5.668(0.029) −6.715(0.093) 
 −7.808(0.082) −7.267(0.061) −6.747(0.025) −7.418(0.064) 
 −8.659(0.062) −8.153(0.043) −7.855(0.022) −8.268(0.042) 
 −9.605(0.051) −9.149(0.033) −8.994(0.021) −9.219(0.028) 
ISEDSCVNRPI
 −6.276(0.128) −5.706(0.077) −5.350(0.036) −5.959(0.091) 
 −7.150(0.109) −6.644(0.061) −6.483(0.033) −6.836(0.065) 
 −8.117(0.118) −7.684(0.043) −7.633(0.031) −7.805(0.048) 
 −9.164(0.142) −8.794(0.034) −8.797(0.028) −8.818(0.037) 
 −6.699(0.114) −6.138(0.081) −5.524(0.033) −6.362(0.090) 
 −7.515(0.095) −6.972(0.059) −6.625(0.028) −7.149(0.063) 
 −8.426(0.069) −7.932(0.045) −7.755(0.027) −8.061(0.044) 
 −9.406(0.065) −8.980(0.035) −8.907(0.024) −9.044(0.029) 
 −7.066(0.109) −6.510(0.083) −5.668(0.029) −6.715(0.093) 
 −7.808(0.082) −7.267(0.061) −6.747(0.025) −7.418(0.064) 
 −8.659(0.062) −8.153(0.043) −7.855(0.022) −8.268(0.042) 
 −9.605(0.051) −9.149(0.033) −8.994(0.021) −9.219(0.028) 

Note: See the Table 3 note.

Table 9:
Averages of Empirical ISE Values over 30 Runs for Benchmark Data Sets.
ISEDSCVNRPI
Housing −0.049(0.096) −0.001(0.000) −0.001(0.000) −0.012(0.002) 
     
Concrete −0.138(0.016) −0.012(0.000) −0.007(0.000) −0.091(0.004) 
     
Yacht −2.346(0.084) −0.046(0.003) −0.028(0.002) −0.277(0.023) 
     
Auto MPG −0.189(0.046) −0.029(0.002) −0.031(0.002) −0.096(0.008) 
     
Servo −0.008(0.001) −0.009(0.001) −0.008(0.001) −0.010(0.002) 
     
Air Foil −0.770(0.043) −0.036(0.001) −0.020(0.001) −0.087(0.002) 
     
Power Plant −0.036(0.002) −0.028(0.000) −0.026(0.000) −0.032(0.000) 
     
ISEDSCVNRPI
Housing −0.049(0.096) −0.001(0.000) −0.001(0.000) −0.012(0.002) 
     
Concrete −0.138(0.016) −0.012(0.000) −0.007(0.000) −0.091(0.004) 
     
Yacht −2.346(0.084) −0.046(0.003) −0.028(0.002) −0.277(0.023) 
     
Auto MPG −0.189(0.046) −0.029(0.002) −0.031(0.002) −0.096(0.008) 
     
Servo −0.008(0.001) −0.009(0.001) −0.008(0.001) −0.010(0.002) 
     
Air Foil −0.770(0.043) −0.036(0.001) −0.020(0.001) −0.087(0.002) 
     
Power Plant −0.036(0.002) −0.028(0.000) −0.026(0.000) −0.032(0.000) 
     

Note: See the Table 3 note.

5.2  Density Derivative Estimation: ISED versus KDE with the Bandwidth Matrix

Recently, several methods to optimize for density derivative estimation have been proposed (Chacón & Duong, 2013). As in density estimation, a conventional idea is a direct extension of the normal reference rule (NR) for density derivative estimation (Chacón et al., 2011), while other ideas are to estimate by optimizing cross-validation criteria extended for density derivative estimation such as the unbiased cross-validation (UCV), smoothed cross-validation (SCV), and plug-in methods (PI). Here, we compare ISED to KDE with the bandwidth matrix in terms of density derivative estimation by performing the same experiments in section 3.

The results are presented in Figures 1011, and 12. In these experiments, except for ISED, we used the R library ks whose details are available in Duong (2007). When the dimensionality of data is small, the performance of PI and SCV is fairly good, but these methods tend to incur large errors for higher-dimensional data, particularly in second-order derivative estimation. For UCV and NR, for higher-dimensional data, their performance is comparable to or slightly better than ISED in second-order derivative estimation, while for other dimensions and first-order derivative estimation, ISED works better than these methods. Overall, the performance of ISED is reasonably good on a wide range of dimensions and in both first- and second-order density derivative estimation.

Figure 10:

Normalized mean squared errors in density derivative estimation when the data density is the standard normal density.

Figure 10:

Normalized mean squared errors in density derivative estimation when the data density is the standard normal density.

Figure 11:

Normalized mean squared errors in density derivative estimation when the data density is a mixture of two gaussians.

Figure 11:

Normalized mean squared errors in density derivative estimation when the data density is a mixture of two gaussians.

Figure 12:

Normalized mean squared errors in density derivative estimation when the data density is a log-normal density.

Figure 12:

Normalized mean squared errors in density derivative estimation when the data density is a log-normal density.

6  Future Directions

In this section, we discuss future directions of our work based on regularized multitask learning (Caruana, 1997; Evgeniou & Pontil, 2004; Argyriou et al., 2007) and Bregman divergences (Bregman, 1967).

6.1  A Multitask Learning Extension

Here, we extend ISED by applying regularized multitask learning (Caruana, 1997; Evgeniou & Pontil, 2004; Argyriou et al., 2007). We first formulate the objective function according to the extension, and then propose two optimization algorithms. Finally, we discuss a variant of ISED as a limit of the MTL extension.

6.1.1  An Extended Objective Function and Its Usefulness

The density derivative estimator has been independently derived with respect to each partial derivative so far. However, partial derivatives include common information because they are computed from the same density function. Thus, making use of this information might provide better estimates. To exploit the common information, Yamane, Sasaki, and Sugiyama (2015) applied the regularized multitask learning (MTL) (Caruana, 1997; Evgeniou & Pontil, 2004; Argyriou et al., 2007) to log-density gradient estimation and improved the performance when the sample size is small and the dimensionality of data is relatively high. Here, we follow and extend their approach to higher-order density derivative estimation.

The two key modifications from the plain ISED are to add a regularizer and use another basis function. By employing the following function as the basis function,
formula
the objective function extended by MTL is formulated as
formula
6.1
where nk denotes the total number of partial derivatives,2 denotes the tensor product, and the lth element in and th elements in and are given by
formula
and are the vectors constructed by concatenating and , respectively, and is a block diagonal matrix whose diagonal blocks are for all . Minimizing equation 6.1 provides the solution , and minimization algorithms are described later.
Next, we discuss why this MTL extension is useful in density derivative estimation. Since
formula
corresponds to a density estimate up to an additive constant, and the regularizer makes closer the two density estimates between and . Thus, the combination of the regularized MTL and the modified basis function would be useful in exploiting the common information among the partial derivatives.

6.1.2  Optimization Algorithm

A simple minimizer of equation 6.1 has the analytic form as
formula
6.2
equation 6.2 seems to be computationally efficient. However, it could be problematic in terms of computational efficiency and memory requirement when the dimensionality of data and the derivative order k are high because the size of is by , and we have to compute the inverse of a matrix with the same size.
To cope with this problem, as done in Yamane et al. (2015, algorithm 1), we propose using a block coordinate descent (BCD) method to minimize equation 6.1, where only one vector is optimized at each update while fixing the other vectors for . At each update, the solution of is analytically computed by
formula
6.3
In the algorithm based on BCD, all are cyclically and repeatedly updated until they converge. The key point is that equation 6.3 only requires computing the inverse of a b by b matrix, which significantly saves the memories and, it is hoped, alleviates the computation burden if the number of iterations is not so large.

6.1.3  A Variant of the MTL-Regularized ISED

Next, we discuss a variant of ISED as a limit of the MTL-regularization. As approaches , become the same vector for all : for all . Under this situation, the linear-in-parameter model for the MTL extension can be expressed as
formula
6.4
Then, the optimal solution can be computed by
formula

The advantage of this variant is that all derivatives of any order can be estimated by solving a linear system of equations only once. Thus, this is computationally much more efficient than the original ISED and MTL extension. However, in exchange for computational efficiency, this variant sacrifices the model flexibility because the number of coefficients in equation 6.4 is reduced from the original linear-in-parameter model.

6.2  A General Framework for Density Derivative Estimation

Here, we discuss a general framework for density derivative estimation based on Bregman divergences (Bregman, 1967) and show another criterion for density derivative estimation, which includes the ISED criterion as a special case.

6.2.1  Density Derivative Estimation Based on Bregman Divergences

A general framework for density derivative estimation can be constructed based on the Bregman (BR) divergence (Bregman, 1967):
formula
6.5
where f is a differentiable and strictly convex function and denotes the derivative of f. Since on the right-hand side of equation 6.5, is the first-order Taylor expansion around t evaluated at , the BR divergence measures the discrepancy between and the linear interpolation from t and is minimized when . The BR divergence includes a variety of divergences as special cases by specifying the form of f (Banerjee, Merugu, Dhillon, & Ghosh, 2005).
The discrepancy from the true density derivative to a density derivative model using the BR divergence is measured by
formula
6.6
where . To approximate equation 6.6 as done in ISED, repeatedly applying integration by parts to the second term results in
formula
Then, the empirical version of equation 6.6 is obtained as
formula
6.7
Specifying the form of f yields a discrepancy whose minimizer is an estimator for density derivatives. Below, we provide an example of f and derive a discrepancy for density derivative estimation.

6.2.2  Basu’s Power Divergence for Density Derivatives

For a positive integer ,3 when
formula
is reduced to Basu’s power divergence (Basu, Harris, Hjort, & Jones, 1998):
formula
Then, the discrepancy for density derivatives based on Basu’s power divergence is given by
formula
The discrepancy is empirically approximated by
formula
6.8
Therefore, minimizing equation 6.8 provides another estimator for density derivatives.

Here, we show a connection to ISED. Interestingly, when , Basu’s power divergence is equivalent to the squared distance, and then is reduced to given by equation 2.3. Thus, ISED is a special case of the minimizer for equation 6.8.

As discussed in Basu et al. (1998) and Sugiyama, Suzuki, and Kanamori (2012), the parameter controls the robustness to outliers. For density derivative estimation, the empirical discrepancy, equation 6.8, also possesses a similar property. For the first term in that equation, as gets larger, the region where has less impact on the integral. To discuss the behavior of the second term in equation 6.8, we expand it as
formula
6.9
where denotes the coefficient related to the partial derivatives of and we assume . For instance, when , equation 6.9 takes the following form:
formula
When and , the sample tends to be less influential in equation 6.9. Therefore, when is larger than k, using the empirical discrepancy, equation 6.8, would highlight the region where has large values. However, when , equation 6.8 is no longer convex (Basu et al., 1998; Sugiyama et al., 2012), and the solution we obtain might be a local minimum. Thus, in this letter, we focused on the case where equation 6.8 is convex, namely, , and the challenging case with will be our future work.

7  Conclusion

In this letter, we proposed a method to directly estimate density derivatives without performing density estimation. The proposed method, which we call the integrated squared error for density derivatives (ISED), was formulated as a least-squares problem, and thus its solution can be computed analytically. In addition, a cross-validation method is provided for tuning all hyperparameters in ISED. We theoretically proved that ISED achieves the optimal parametric convergence rate and experimentally showed that ISED tends to outperform a KDE-based method for density derivative estimation. Furthermore, we applied ISED to KL-divergence approximation and bandwidth matrix selection for KDE, and the usefulness of ISED was shown through applications to change detection, feature selection, and density estimation.

Density derivatives are useful quantities and could potentially have more applications in statistical data analysis. Therefore, in addition to the future directions discussed in section 6, we will look for novel applications of ISED.

Appendix:  Proof of Theorem 1

In this appendix, we give the proof of theorem 1 by following Sasaki, Niu, and Sugiyama (2016), which is based on a theory of perturbed optimization (Bonnans & Shapiro, 1998). The proof employs proposition 6.1 in Bonnans and Shapiro (1998), which requires the growth condition (see definition 6.1 in Bonnans & Shapiro, 1998) and Lipschitz continuity. Therefore, we first establish them in lemmas 3 and 4, respectively. Using these lemmas and the proposition, theorem 1 is finally proved.

A.1  The Second-Order Growth Condition

We establish a second-order growth condition by the following lemma:

Lemma 1.
The following second-order growth condition holds:
formula
where is a nonnegative parameter.
Proof.
The strong convexity of with a parameter provides the following inequality:
formula
where we used the optimal condition .

A.2  Lipschitz Continuity

Here, we analyze the Lipschitz continuity of the difference function between a perturbed objective function and . Denoting perturbation parameters by
formula
we define a perturbed version of the objective function as
formula
It is obvious that , and then the difference function is characterized by the following lemma:
Lemma 2.
The difference function is Lipschitz continuous modulus,
formula
on a sufficiently small neighborhood of .
Proof.
The gradient of the difference function is computed by
formula
Since is a positive definite matrix, where M is a positive constant. Given a -ball of which is , for , the following inequality holds:
formula
This inequality bounds the norm of the gradient as
formula
The above inequality shows that the order of the norm of the gradient is , that is, the difference function is Lipschitz continuous on with a Lipschitz constant of the same order.

A.3  The Optimal Parametric Convergence Rate of ISED

Based on lemmas 3 and 4 and proposition 6.1 in Bonnans and Shapiro (1998), we have
formula
where