## Abstract

Gaussian processes (GPs) are promising Bayesian methods for classification and regression problems. Design of a GP classifier and making predictions using it is, however, computationally demanding, especially when the training set size is large. Sparse GP classifiers are known to overcome this limitation. In this letter, we propose and study a validation-based method for sparse GP classifier design. The proposed method uses a negative log predictive (NLP) loss measure, which is easy to compute for GP models. We use this measure for both basis vector selection and hyperparameter adaptation. The experimental results on several real-world benchmark data sets show better or comparable generalization performance over existing methods.

## 1. Introduction

Gaussian process (GP) models are nonparametric Bayesian models and have been successfully applied to a wide variety of machine learning tasks such as classification and regression (Rasmussen & Williams, 2006). GPs for regression model the density of the target values as a multivariate gaussian density. Inference in the GP regression model with gaussian noise can be done analytically. Probabilistic classification using GPs is, however, analytically intractable. In the case of classification problem, the target values, being class labels, cannot be modeled using a multivariate gaussian density. Thus, the likelihood is nongaussian, resulting in a nongaussian posterior process. Several approaches to approximate the posterior process have been suggested, including Laplace's method, application of the variational method, assumed density filtering (ADF), and expectation propagation (EP) (Williams & Barber, 1998; Gibbs & MacKay, 2000; Minka, 2001; Kuss & Rasmussen, 2005). The approximation of Opper and Winther (2000) is closely related to the EP method. Kim and Ghahramani (2006) proposed an approximate algorithm for GP classifier design. This approach is based on the EP algorithm and uses a variational lower bound on the marginal likelihood for model selection. Neal (1998) described the use of Markov chain Monte Carlo (MCMC) approximation for GPs. Use of these approximations for the GP classifier (GPC) design is computationally expensive, as their computational and memory requirements grow, respectively, as the cube and the square of the training set size. Moreover, they make use of all the training set examples for prediction, resulting in slow inference speed, especially when the training data set size is large. Hence, the development of a good sparse GP classifier (SGPC) is important from a practical viewpoint.

Sparse approximate GP classifiers aim at performing all the operations using a representative data set, called the *basis vector* set or *active* set, from the input space. Note that the active set need not be constrained to be a subset of the training or test set. This has been demonstrated by Snelson and Ghahramani (2006) for sparse GP regression model design. In this way, the computational and memory requirements are reduced to *O*(*Nd*^{2}_{max}) and *O*(*Nd*_{max}), respectively, where *N* is the size of the training set and *d*_{max} is the size of the representative set (*d*_{max} ≪ *N*). Further, the computations of predictive mean and variance require *O*(*d*_{max}) and *O*(*d*^{2}_{max}) effort, respectively. Such parsimonious classifiers are preferable in many engineering applications because of lower computational complexity and ease of interpretation.

In this work, we focus on finding a sparse solution of a binary classification problem using GPs. The active set is assumed to be a subset of the data in order to simplify the optimization problem. Several approaches have been proposed to design sparse GP classifiers: relevance vector machine (RVM) (Tipping, 2001), online GP learning (Csató & Opper, 2002; Csató, 2002), and informative vector machine (IVM) (Lawrence, Seeger, & Herbrich, 2003). Particularly relevant to the work in this letter is the IVM method, which is inspired by the technique of ADF (Minka, 2001).

In addition to the sparse GP classifiers, various approaches to develop sparse probabilistic classifiers not based on the GP models were suggested. These include the import vector machine, built on kernel logistic regression (Zhu & Hastie, 2005), and orthogonal forward selection procedure, designed in conjunction with minimization of leave-one-out cross-validation (LOO-CV) error (Hong, Chen, & Harris, 2006).

The main aim of this letter is to give an effective method to design a sparse GP classifier that has good generalization ability. Using ideas from validation methodology and sparse GP literature, we propose and study a method that combines an ADF approximation technique with a validation-based measure for SGPC design. Specifically, we use the negative log predictive (NLP) loss measure, which takes into account a moderated probability score, easily calculated for a GP model. Use of the NLP loss measure alleviates the tendency of assigning overly high confidence to wrong predictions or predictions in the regions where data are sparse. The algorithm in the proposed method alternates between basis vector set selection and hyperparameter adaptation step and uses the NLP loss measure for both tasks. This is unlike the IVM method for SGPC design, where different measures are used for these two purposes. We tested the proposed method on various benchmark data sets. The sparse GP classifier designed was observed to give comparable generalization performance to the full model GP classifier. Though the proposed method is computationally more expensive in the basis vector set selection step compared to the IVM method, the classifier designed was observed to exhibit better generalization ability using smaller number of basis vectors. Thus, the proposed method provides a useful solution to the SGPC design problem.

In the next section, we discuss the details of the validation-based method for SGPC design. A discussion about the proposed method and a brief overview of the related work are presented in section 3. Section 4 gives experimental results that demonstrate the effectiveness of our method compared to the full model and other sparse GP classifiers. Section 5 concludes the letter.

## 2. Validation-Based Sparse GP Classifier

Given a training data set , which is independent and identically distributed, our goal is to design a sparse GP classifier using this data set. The classifier designed should have good generalization ability. In addition to the classification label *y* of an unseen example **x**, we are also interested in estimating *p*(*y*|**x**).

### 2.1. GP Classifier.

Let **X** = (**x**_{1}, **x**_{2}, …, **x**_{N}) and **y** = (*y*_{1}, *y*_{2}, …, *y _{N}*)

^{T}denote the inputs and the class labels, respectively. Unlike other parametric models, the learning in GP involves learning latent functions

*f*of the input variable . In classification, the latent function is mapped to the unit interval by a sigmoidal transformation like logistic regression (logit) or probit regression to get

*p*(

*y*|

**x**). We assume that the probability of class label

*y*is independent of

**x**and , given

*f*(Kim & Ghahramani, 2006). In this work, we use the probit noise model,

*p*(

*y*|

*f*) = Φ(λ

*y*(

*f*+

*b*)), where Φ(·) is the cumulative distribution function of the standard gaussian, , the slope of which is controlled by λ(> 0) and

*b*is a bias parameter.

Let *f _{i}* =

*f*(

**x**

_{i}) and

**f**= (

*f*

_{1},

*f*

_{2}, …,

*f*)

_{N}^{T}denote the vector of latent function values. In the GP context, where the joint distribution of

**f**is assumed to be a multivariate gaussian, we use a zero-mean GP prior (O'Hagan, 1978) as a prior over

**f**. Specifically, where

**K**is a positive definite kernel or covariance matrix whose (

*i*,

*j*)th entry is defined as and denotes the set of hyperparameters associated with the kernel function (Rasmussen & Williams, 2006). A common choice of the kernel function

*k*is the gaussian kernel, where σ is the width parameter,

*v*is the scaling constant, and .

**f**, the class labels are independent variables, and so the joint likelihood function is written as The individual likelihood term depends on

*f*through its value evaluated at

**x**. The posterior distribution over the latent function values

**f**for the given set of hyperparameters is obtained by using Bayes' rule:

*y*

_{*}for an unseen test input

**x**

_{*}. This can be done by finding the distribution of

*f*

_{*}as and then using it to compute the predictive distribution, Neither the posterior distribution in equation 2.1 nor the predictive distribution in equation 2.3 can be computed analytically, so approximations like gaussian approximation to the posterior are needed for GP classifier design. Then the use of the probit noise model results in mathematical tractability of equation 2.3 (Lawrence, Platt, & Jordan, 2005). Note that the probit noise model allows expressing the predictive distribution for a data point (see equation 2.3), but the same predictive distribution cannot be expressed for multiple data points.

Neal (1998) proposed using Markov chain Monte Carlo (MCMC) sampling methods to sample from GP posterior for approximate GP model learning. This technique is, however, computationally demanding. Several approaches to find a gaussian approximation to have been suggested. These include Laplace's method and the expectation propagation (EP) algorithm (Williams & Barber, 1998; Minka, 2001). Kuss and Rasmussen (2005) presented a comprehensive comparison of these approaches in terms of their predictive probabilities and marginal likelihood estimates and concluded that the EP algorithm is preferable over Laplace's method for approximate inference in binary GP classifiers. Note that the GP classifier designed using the EP algorithm makes use of the entire training data for making predictions on unseen examples. We call such a GP classifier a *full model* GP classifier. Any approach for full model GP classifier design is computationally expensive as *N* increases. Thus, there is a need to develop an efficient algorithm that designs a sparse GP classifier having generalization ability comparable to that of a full model GP classifier.

### 2.2. ADF Approximation.

*p*(

*y*|

_{i}*f*) for each data point (

_{i}**x**

_{i},

*y*) be approximated by an unnormalized gaussian (also called the

_{i}*site function*) with appropriately chosen parameters, mean

*m*, and variance

_{i}*p*

^{−1}

_{i}. Then the posterior distribution in equation 2.1 can be approximated as This can be rewritten as where , ,

**m**= (

*m*

_{1}, …,

*m*)

_{N}^{T}, and (

*p*

_{1}, …,

*p*). Here and

_{N}**A**denote the posterior mean and posterior covariance matrix, respectively. The ADF (assumed density filtering) (Minka, 2001) technique takes advantage of the factorized form in equation (2.4) to build an approximation to . Gaussian process classifier learning using the ADF approximation thus involves finding the site parameters

*m*and

_{i}*p*for every

_{i}*i*∈ {1, 2, …,

*N*} and the hyperparameters .

#### 2.2.1. Site Parameter Estimation.

Minka (2001) proposed the EP algorithm to optimize the site parameters. This algorithm finds a gaussian approximation to the posterior by moment matching of approximate marginal distributions. It iteratively updates the site parameters, thereby updating the approximation. Each update of site parameters would require the computation of **A** and , which is *O*(*N*^{2}) effort using rank-one matrix update. The algorithm, however, revisits all the data points in every sweep through the data. Hence, the computational complexity of updating and **A** is *O*(*N*^{3}) for one sweep (Kuss & Rasmussen, 2005).

#### 2.2.2. Hyperparameter Estimation.

*k*(·, ·) involves maximizing the approximate log marginal likelihood, where

*Z*is a normalizing constant corresponding to the site function

_{i}*i*(Kuss & Rasmussen, 2005). Note that the partial derivatives of with respect to the hyperparameters after convergence of the EP algorithm do not depend on the normalizing constants

*Z*(Seeger, 2005); therefore,

_{i}*Z*'s are not taken into account during hyperparameter optimization. Clearly, the computation of marginal likelihood requires an

_{i}*N*×

*N*matrix inversion. It is thus important to get a sparse representation of the data set to reduce this complexity.

### 2.3. Sparse GPC Design Using ADF.

**u**denotes the index set of training set examples, which are included in the approximation (i.e.,

*p*= 0,

_{i}*m*= 0 for all

_{i}*i*∉

**u**), then we have an approximation of as The set

**u**is called the

*active*or

*basis vector*set (Lawrence et al., 2003).

^{1}Let

**u**

^{c}= {1, 2, …,

*N*} ∖

**u**. We refer to the set

**u**

^{c}as the nonactive vector set. For many classification problems, the size of the active set is restricted to the user-specified parameter,

*d*

_{max}, depending on the classifier complexity and generalization performance requirements. In this work, a sparse GP classifier model refers to the set

**u**, the corresponding site parameters, and the hyperparameters. It is important to choose the model such that the generalization performance of the resulting approximate GP classifier is as good as that of a full model GP classifier.

The sparse GP classifier design method using the ADF scheme uses a two-loop approach where it alternates between the basis vector set selection and site parameter estimation loop (inner loop) and the hyperparameter adaptation loop (outer loop) until a stopping condition is satisfied. The inner loop starts with an empty basis vector set. A winner vector is chosen from the set **u**^{c} using a suitable scoring function and is added to the current model with appropriate site parameters. The index of this winner is removed from the set **u**^{c} and added to the set **u**. This procedure in the inner loop is repeated until *d*_{max} basis vectors are added. By keeping the basis vector set **u** and the corresponding site parameters (obtained in the inner loop) fixed, we determine the hyperparameters in the outer loop by optimizing a suitable measure. Informative vector machine (IVM) suggested by Lawrence et al. (2003) uses this approach, where the scoring function used is the entropy measure and the hyperparameters are determined by maximizing the marginal likelihood.

### 2.4. Validation-Based Method.

**u**, selected from the training set, one can use the following validation-based measure, which represents the predictive misclassification loss (PML), where and

*I*(·) is the indicator function. Note that the PML measure is evaluated on the training set examples, which are not in the basis vector set. It can be used for basis vector selection in the inner loop. However, because of the nondifferentiability, it cannot be used for hyperparameter adaptation in the outer loop. It is possible to use a grid-based approach for hyperparameter determination in the outer loop. This approach is, however, computationally expensive if the number of hyperparameters is large.

#### 2.4.1. NLP Loss.

^{2}

_{*}are predictive mean and variance, respectively, for an unseen input

**x**

_{*}, λ(> 0) is a scaling parameter, and

*b*is a bias parameter (Lawrence, Seeger, & Herbrich, 2005). Here Φ(·) denotes the cumulative density function of the standard normal distribution. Note that the dependencies of and σ

^{2}

_{*}on

**u**and other hyperparameters are not shown explicitly. The class label of an unseen input

**x**

_{*}is predicted using (corresponding to the probability threshold of 0.5), while the confidence about the prediction can be estimated using equation (2.7).

*y*

_{*}

*e*

_{*}), explicitly uses the predictive variance, where

*k*(

**x**

_{*}) = (

*k*(

**x**

_{*},

**x**

_{i}))

_{i∈u}and . Here the subscript, (

**u**,

**u**), of a matrix is used to represent the rows and columns of the matrix corresponding to the elements of the set

**u**. The predictive variance for an example

**x**

_{*}far from all training data tends to

*k*(

**x**

_{*},

**x**

_{*}), the prior variance, as

*k*(

**x**

_{*}) →

**0**if gaussian kernel function is used. In the regions where the basis vectors are reasonably dense, predictive variance is expected to be small, and . The presence of the predictive variance term in the denominator moderates the probability score (see equation 2.7). MacKay (1992) suggested using a moderated probability score while computing the predictive performance of a model. Since the predictive mean and variance are easily available for GP models (during training), a moderated probability score can be used effectively for the SGPC design. Estimation of this score during the inference stage, however, requires extra computational effort of

*O*(

*d*

^{2}

_{max}) because of the predictive variance calculation.

**u**is defined as where and

**A**

_{jj}, respectively, denote the predictive mean and predictive variance for the

*j*th training set sample. Zero is the minimum value of this loss if the predictions are correct with 100% confidence ( and

**A**

_{jj}≈ 0). In general, if is large enough compared to

**A**

_{jj}, the NLP loss value would be small. On the other hand, the loss value is large if the predictor is certain about the wrong prediction. The advantage of using the NLP loss measure over the PML measure is that it estimates the predictive ability of the model by using both predictive mean, , and predictive variance, diag(

**A**). More important, the NLP loss measure is a differentiable function. So it can be optimized using any standard nonlinear optimization technique for hyperparameter adaptation. Thus, the NLP loss measure can be used for both basis vector selection in the inner loop and hyperparameter adaptation in the outer loop. This makes the proposed algorithm coherent, as the same measure is used for both basis vector selection and hyperparameter adaptation.

We emphasize that the NLP loss measure (see equation 2.8) may not be a suitable criterion if |**u**^{c}| is very small compared to *N*. For practical purposes, it can be used by choosing *d*_{max} ≪ *N*. However, the leave-one-out cross-validation-based NLP loss measure for the full model can be established as given in Sundararajan and Keerthi (2008).

### 2.5. Implementation Aspects.

The algorithm in the proposed method alternates between the inner and outer loop until a stopping condition is satisfied. We now describe the procedure for basis vector addition in the inner loop. The corresponding site parameters are determined by matching moments of an approximation to the posterior in equation (2.5). More details about the necessary calculations can be found in Lawrence, Seeger, et al. (2005).

**u**= ϕ, and all the components of

**p**and

**m**are set to zero. The algorithm progresses by selecting the appropriate basis vector from the set

**u**

^{c}. Suppose that an example index

*i*∈

**u**

^{c}is added to the set

**u**. We then require evaluating using equation 2.8, where . This will necessitate the calculation of and the diagonal entries of the matrix

**A**corresponding to . As described below, this can be done efficiently by using the readily available and

**A**corresponding to

**u**. In the following discussion, we use the subscript (

**u**,

**u**) to denote the rows and columns of the matrix corresponding to the elements of the set

**u**. The subscript (·,

**u**) denotes the columns of the matrix corresponding to the elements of the set

**u**. Recall and . Since the site parameters corresponding to the nonbasis vector set are zero, we can write

**A**=

**K**−

**M**

^{T}

**M**where (Lawrence et al., 2003). Here,

**L**is the lower-triangular Cholesky factor of . Note that the matrix

**M**is of size |

**u**| ×

*N*. So the addition of a basis vector into the set

**u**would require adding an extra row and column to the matrix

**L**and adding an extra row to the matrix

**M**, that is, The necessary variables in equation 2.9 are computed using and the site parameters are determined using the set of variables defined below: Since

**A**=

**K**−

**M**

^{T}

**M**, updating the matrix

**A**will thus need a simple rank-one downdate. Using the fact that , we get the following formulas for updating diag(

**A**) and , Note that in equation 2.13, denotes squaring of each element in . The computations in equations 2.9 to 2.13 have

*O*(

*Nd*

_{max}) complexity.

*j*∈

**u**

^{c}can be tested for possible inclusion in the basis vector index set

**u**. The

*i*th example is selected for inclusion in

**u**if . However, this procedure of basis vector selection requires the computational effort of

*O*(

*N*

^{2}

*d*

_{max}). Since searching for an appropriate basis vector from the set

**u**

^{c}is expensive due to memory and computational cost requirements, the proposed method maintains a set

**J**⊆

**u**

^{c}of fixed size κ; we refer to this as

*working set*. Smola and Bartlett (2000) suggested constructing this working set in each iteration by randomly choosing κ elements from

**u**

^{c}and set κ to 59. In our implementation, we set κ to min(|

**u**

^{c}|, 59). The

*i*th example is selected for inclusion in

**u**if Thus, the computational effort needed to select a basis vector is reduced to

*O*(κ

*Nd*

_{max}).

When the winner example index *i* is included in the set **u**, the corresponding site parameters *p _{i}* and

*m*are updated according to equations 2.11 and 2.12. The matrices

_{i}**L**and

**m**are updated according to equations 2.9 and 2.10. The predictive variance and mean are updated using equation 2.13. This procedure is repeated until

*d*

_{max}basis vectors are added to the model.

The method then switches to the outer loop, where the hyperparameters are determined by minimizing the NLP measure, keeping the basis vectors and site parameters fixed. Ideally, a small change in the hyperparameters would require an appropriate change in the site parameters. However, in our method, we fix the site parameters during the adaptation of hyperparameters to reduce the computational cost. Any nonlinear optimization technique like conjugate gradient method with one or two line searches is suitable for hyperparameter adaptation. In our implementation, this two-loop algorithm terminates when there is no significant reduction in the NLP loss over five consecutive iterations. The iterative algorithm is as follows:

Initialize the hyperparameters λ,

*b*, and .Initialize .

Select a random basis vector

*i*from**u**^{c}.Update the site parameters

*p*and_{i}*m*according to equations 2.11 and 2.12 and the matrices_{i}**M**and**L**according to equations 2.9 and 2.10. Update and diag(**A**) for the basis vector set using equation 2.13.**u**=**u**∪ {*i*} and**u**^{c}=**u**^{c}∖ {*i*}.If |

**u**| <*d*_{max}, create a working set**J**⊆**u**^{c}, find*i*according to equation 2.14, and go to step 4.Reestimate the hyperparameters λ,

*b*, and by minimizing NLP in equation 2.8 by keeping**u**and the corresponding site parameters fixed.Terminate if the stopping criterion is satisfied. Otherwise, go to step 2.

## 3. Discussion

In this section, we discuss the key aspects of the proposed method and related methods. The data sets referred to in this section are real-world data sets, and their details can be found in the next section.

### 3.1. NLP Loss and Sparsity.

An alternative approach for basis vector selection in the inner loop would be the use of predictive misclassification loss defined in equation 2.6. The basis vector selection is done using equation 2.14 with the NLP loss replaced by the PML. We compared the behaviors of training set PML and NLP loss as the basis vectors are added to the model. The results on the Banana and Ringnorm data sets are shown, respectively, in the left and right panels of Figure 1. It can be seen that the first local minimum of PML measure is attained using a very small number of basis vectors, and there is no significant change in this measure even if more basis vectors are added to the model. So the use of the PML measure for basis vector set selection resulted in a sparser model. But this model exhibited a large test set NLP loss. Note that at this local minimum of PML measure, the training set NLP loss is still large and continues to decrease as more basis vectors are added. The reason is evident from equation 2.8. As more basis vectors are added to the model in the proposed method, the variance term, **A**_{jj}, in the denominator is reduced. This is useful in reducing the NLP loss function for all the correctly classified data points, thereby improving the generalization performance.

We also conducted an experiment where the NLP loss measure was used for both basis vector selection and hyperparameter adaptation. The behaviors of the training set NLP loss and the test set NLP loss were monitored as the number of outer loop iterations progress. This is depicted in Figure 2 for the Banana data set. It is observed from the figure that the training set NLP loss tracks the test set NLP loss reasonably well. This observation provides a good stopping criterion for the proposed algorithm. The algorithm can terminate if there is no significant reduction in the training set NLP loss over a few consecutive iterations.

The proposed method is thus more suitable when the generalization error is assessed in terms of the NLP loss. Further, it is very effective for both basis vector set selection and hyperparameter adaptation.

### 3.2. Choice of Basis Vectors.

The choice of basis vectors plays an important role in the sparseness and generalization performance of a classifier. To illustrate graphically the selection of basis vectors using the proposed method, we conducted an experiment on the Banana data set consisting of 400 training set samples in a two-dimensional space. Figure 3a shows the training data set with two overlapping classes. The basis vectors are shown in Figure 3b. The decision boundary produced by the proposed classifier with *d*_{max} = *N*/5 is shown in Figure 3 by a solid line. It can be seen that the proposed model selection method chooses the basis vector set carefully so as to represent the input data distribution quite well. Due to this, the predictive variances of most of the data points become small. This results in better performance in terms of NLP loss for all the correctly classified data points. Further, some basis vectors lie away from the boundary and are also selected from the regions where data are sparse. This observation was also made in case of the relevance vector machine by Tipping (2001).

At this juncture, it is interesting to compare the choice of basis vector set of a sparse GP classifier with that of another popular sparse classifier, support vector machine (SVM; Schölkopf & Smola, 2002). The basis vectors (also called support vectors) lie on either the supporting hyperplanes of the two classes or the wrong sides of these two hyperplanes. There are no support vectors in the regions away from the supporting hyperplanes and on the correct sides of the supporting hyperplanes. Hence, it is unlikely to get support vectors that represent the input distribution reasonably well. Due to this, even if one attempts to estimate the predictive variance using the SVM model, it will be large in certain regions. Although Platt (1999) presented a method for extracting predictive probabilities from SVM outputs by postprocessing, this method does not provide predictive variance information, which is essential in various problem domains (Seeger, 2005). The weaknesses of Platt's method for predictive probability estimation were pointed out in Tipping (2001) and Seeger (2005).

### 3.3. Hyperparameter Adaptation.

To study the importance of hyperparameter adaptation, we conducted the following experiments on two different benchmark data sets. First, the behavior of the NLP loss on the training set as basis vectors are added to the model was observed in the first iteration of the inner loop (where the hyperparameters were initialized using techniques described in Schölkopf and Smola (2002)). Next, the proposed method was used for basis vector set selection and hyperparameter adaptation for a few outer loop iterations. The behavior of the NLP loss on the training set was observed during the last inner loop iteration. The effect of hyperparameter adaptation on the training set NLP and the generalization performance is clearly evident from Figure 4. The hyperparameter adaptation has resulted in better training set and test set NLP loss. The performance in terms of test set error is also better with hyperparameter adaptation. The training set NLP loss is typically observed to track the test set NLP loss. Therefore, the hyperparameter adaptation in conjunction with basis vector set selection results in better generalization performance.

### 3.4. Related Work.

We now briefly discuss the related methods for sparse GP classifier design suggested in literature. The relevance vector machine (RVM) introduced by Tipping (2001) produces a sparse probabilistic model by using an improper hierarchical prior and optimizing over hyperparameters. It can be seen as a degenerate GP, however, it suffers from the drawback that the certainty about its prediction increases as one moves further away from the training data. Rasmussen and Quiñonero-Candela (2005) suggested a solution to overcome this problem. But the resulting model was not truly a sparse model. Csató and Opper (2002) proposed a greedy algorithm that computes a sparse gaussian approximation for the posterior of GP models by minimizing Kullback-Leibler divergence between the approximated posterior (obtained using sparse representation) and the true GP posterior. This is an online algorithm and builds a sparse model as new samples are seen. It was especially suggested for large data sets in an online setting and was designed to make a single sweep through the data set. Thus, it did not address the problem of hyperparameter adaptation. As we mentioned earlier in this section, hyperparameter adaptation is important and can result in a better solution.

*i*(in step 5 of the Algorithm) such that The basis vector selection costs

*O*(

*N*) for the IVM method as compared to

*O*(κ

*Nd*

_{max}) for the proposed method. Note that the proposed method selects a basis vector in every iteration from κ examples in the set

**u**

^{c}, whereas the IVM method selects it from all the examples in

**u**

^{c}due to inexpensive score evaluation per example. This makes the IVM method more appropriate for large data sets. We also observed that the extra effort needed for the proposed method in the inner loop is not significantly high, as the outer loop typically needs about 10 iterations. The proposed method requires

*O*(

*Nd*

^{2}

_{max}) effort for hyperparameter adaptation, and its overall computational complexity is

*O*(κ

*Nd*

^{2}

_{max}). In the next section, we compare the performance of the proposed method with the IVM method and empirically find that using the NLP loss measure for both basis vector selection and hyperparameter adaptation is useful from sparse model design and a good generalization points of view.

## 4. Numerical Experiments

In this section, we compare our method against other GP classifiers on various real-world benchmark data sets to verify the efficacy of our method. Twelve benchmark data sets, summarized in Table 1, were used for this purpose.^{2} Detailed information about these data sets can be found in Rätsch, Onoda, and Müller (2001). All the data sets except the Image and Splice data sets have 100 realizations of the training and test data. The Image and Splice data sets have 20 realizations of the training and test data. Further, for the Splice data set, we used only variables 28 to 34 instead of original 60 input variables since they are the relevant ones. For all experiments, we used gaussian kernel function defined by where *v*, σ>0.

Data Set . | N
. | d
. | N
. _{test} |
---|---|---|---|

Banana | 400 | 2 | 4900 |

Breast Cancer | 200 | 9 | 77 |

Diabetes | 468 | 8 | 300 |

German | 700 | 20 | 300 |

Heart | 170 | 13 | 100 |

Image | 1300 | 18 | 1010 |

Ringnorm | 400 | 20 | 7000 |

Splice | 1000 | 7 | 2175 |

Thyroid | 140 | 5 | 75 |

Titanic | 150 | 3 | 2051 |

Twonorm | 400 | 20 | 7000 |

Waveform | 400 | 21 | 4600 |

Data Set . | N
. | d
. | N
. _{test} |
---|---|---|---|

Banana | 400 | 2 | 4900 |

Breast Cancer | 200 | 9 | 77 |

Diabetes | 468 | 8 | 300 |

German | 700 | 20 | 300 |

Heart | 170 | 13 | 100 |

Image | 1300 | 18 | 1010 |

Ringnorm | 400 | 20 | 7000 |

Splice | 1000 | 7 | 2175 |

Thyroid | 140 | 5 | 75 |

Titanic | 150 | 3 | 2051 |

Twonorm | 400 | 20 | 7000 |

Waveform | 400 | 21 | 4600 |

Note: *N* is the size of the training data, *d* is the input dimension, and *N _{test}* is the size of the test data.

We compare the following methods:

The baseline method (baseline) that selects

**u**at random (κ = 1)The IVM method (IVM)

^{3}Our method (NLP) discussed in section 2 with κ = |

**u**^{c}| and κ = 59The GP (full model GP) classification algorithm

For the baseline and NLP methods, *d*_{max} was set to *N*/5. The IVM method was tried over a few values of *d*_{max}, and the results are reported for the value of *d*_{max}, that resulted in the least test set NLP loss.

In all the experiments, the hyperparameters were initialized using techniques described in Schölkopf and Smola (2002). In the full model GP classifier design, the site parameters associated with the training set examples were updated using expectation propagation ideas (Minka, 2001; Kuss & Rasmussen, 2005), and the hyperparameters were adapted by minimizing the NLP loss (Sundararajan & Keerthi, 2008). The comparative test set NLP loss values for these methods on various data sets are reported in Table 2. From this table, it is clear that the generalization performance of the proposed method using *N*/5 basis vectors is close to that of the full model GP classifier on several data sets. Further, the performance of the classifier designed using κ = |**u**^{c}| is not significantly better than that of the classifier designed using κ = 59.^{4} For the data sets like Breast Cancer, Thyroid, and Heart, where the sizes of the training and test data are small, performance close to that of the full model GP was observed even with *d*_{max} = *N*/10. This was also the case for the data sets like Banana, German, and Titanic. Notice also the large values of standard deviation in the NLP loss for the Breast Cancer, Thyroid, and Heart data sets. This is possibly due to the small sizes of the test data. The remaining data sets, however, needed higher values of *d*_{max} to get better test set NLP loss. This is demonstrated for the Image and Thyroid data sets in Table 3. For the Thyroid data set, the test set NLP loss decreases until *d*_{max} = *N*/3 and starts increasing as *d*_{max} increases beyond this value. This data set contains only 140 training set examples, and therefore if *N*/2 basis vectors are used in the model, the model overfits the data. This behavior was not observed for the Image data set. In general, as the validation set size starts decreasing, overfitting can take place. The exact validation set size, when overfitting occurs, depends on the data set. For large data sets, *d*_{max} can be chosen such that *d*_{max} ≪ *N*, and therefore one can reliably predict the performance of the model. However, if *N* is small, the full model GP classifier designed using leave-one-out-based NLP loss can be used.

Data Set . | Full Model GP . | Baseline . | NLP (κ = |u^{c}|)
. | NLP(κ = 59) . | IVM . |
---|---|---|---|---|---|

Banana | 24.62 ± 1.78 | 30.33 ± 2.42 | 25.06 ± 1.29 | 25.08 ± 1.24 | 24.65 ± 0.79 |

Breast Cancer | 53.95 ± 4.91 | 55.75 ± 4.23 | 54.22 ± 7.99 | 55.13 ± 6.81 | 63.19 ± 4.34 |

Diabetes | 47.83 ± 2.04 | 50.18 ± 2.36 | 49.49 ± 2.46 | 49.24 ± 2.34 | 56.61 ± 6.46 |

German | 48.86 ± 2.76 | 51.37 ± 2.46 | 49.09 ± 3.02 | 49.20 ± 2.99 | 60.03 ± 6.87 |

Heart | 39.94 ± 5.95 | 43.43 ± 4.72 | 40.24 ± 5.37 | 40.64 ± 5.73 | 48.19 ± 9.82 |

Image | 8.74 ± 1.23 | 19.76 ± 2.79 | 12.06 ± 0.94 | 12.14 ± 0.99 | 15.29 ± 1.45 |

Ringnorm | 16.38 ± 1.30 | 17.34 ± 1.99 | 12.31 ± 1.57 | 11.98 ± 1.50 | 13.24 ± 0.61 |

Splice | 18.36 ± 1.12 | 26.11 ± 1.58 | 18.72 ± 0.80 | 18.92 ± 0.76 | 19.78 ± 2.30 |

Thyroid | 9.97 ± 3.69 | 18.24 ± 5.13 | 14.15 ± 3.14 | 14.38 ± 4.12 | 12.23 ± 3.49 |

Titanic | 51.57 ± 1.61 | 52.97 ± 1.59 | 52.10 ± 1.62 | 51.98 ± 1.64 | 57.68 ± 5.51 |

Twonorm | 7.59 ± 1.46 | 12.3 ± 1.29 | 8.16 ± 0.61 | 8.06 ± 0.46 | 9.60 ± 0.93 |

Waveform | 22.11 ± 0.74 | 27.61 ± 1.48 | 22.91 ± 0.69 | 23.01 ± 0.7 | 23.18 ± 0.74 |

Data Set . | Full Model GP . | Baseline . | NLP (κ = |u^{c}|)
. | NLP(κ = 59) . | IVM . |
---|---|---|---|---|---|

Banana | 24.62 ± 1.78 | 30.33 ± 2.42 | 25.06 ± 1.29 | 25.08 ± 1.24 | 24.65 ± 0.79 |

Breast Cancer | 53.95 ± 4.91 | 55.75 ± 4.23 | 54.22 ± 7.99 | 55.13 ± 6.81 | 63.19 ± 4.34 |

Diabetes | 47.83 ± 2.04 | 50.18 ± 2.36 | 49.49 ± 2.46 | 49.24 ± 2.34 | 56.61 ± 6.46 |

German | 48.86 ± 2.76 | 51.37 ± 2.46 | 49.09 ± 3.02 | 49.20 ± 2.99 | 60.03 ± 6.87 |

Heart | 39.94 ± 5.95 | 43.43 ± 4.72 | 40.24 ± 5.37 | 40.64 ± 5.73 | 48.19 ± 9.82 |

Image | 8.74 ± 1.23 | 19.76 ± 2.79 | 12.06 ± 0.94 | 12.14 ± 0.99 | 15.29 ± 1.45 |

Ringnorm | 16.38 ± 1.30 | 17.34 ± 1.99 | 12.31 ± 1.57 | 11.98 ± 1.50 | 13.24 ± 0.61 |

Splice | 18.36 ± 1.12 | 26.11 ± 1.58 | 18.72 ± 0.80 | 18.92 ± 0.76 | 19.78 ± 2.30 |

Thyroid | 9.97 ± 3.69 | 18.24 ± 5.13 | 14.15 ± 3.14 | 14.38 ± 4.12 | 12.23 ± 3.49 |

Titanic | 51.57 ± 1.61 | 52.97 ± 1.59 | 52.10 ± 1.62 | 51.98 ± 1.64 | 57.68 ± 5.51 |

Twonorm | 7.59 ± 1.46 | 12.3 ± 1.29 | 8.16 ± 0.61 | 8.06 ± 0.46 | 9.60 ± 0.93 |

Waveform | 22.11 ± 0.74 | 27.61 ± 1.48 | 22.91 ± 0.69 | 23.01 ± 0.7 | 23.18 ± 0.74 |

Notes: Test set NLP loss values (×10^{−2}) reported are the averages over 100 trials (20 trials for the Splice and Image data sets), along with the standard deviations. The results for the baseline and NLP methods are reported for *N*/5 basis vectors. For the IVM method, the number of basis vectors for all the data sets is *N*/2 except for the Image and Titanic data sets, for which the numbers of basis vectors are *N*/3 and 2*N*/3, respectively.

. | Test Set NLP (×10^{−2}). | |
---|---|---|

d_{max}
. | Image . | Thyroid . |

N/20 | 23.92 ± 1.06 | 20.96 ± 4.06 |

N/10 | 17.70 ± 1.38 | 15.21 ± 4.18 |

N/5 | 12.14 ± 0.99 | 14.38 ± 4.12 |

N/4 | 11.20 ± 0.8 | 13.33 ± 2.81 |

N/3 | 9.97 ± 1.4 | 13.23 ± 4.85 |

N/2 | 9.76 ± 1.08 | 20.62 ± 3.59 |

. | Test Set NLP (×10^{−2}). | |
---|---|---|

d_{max}
. | Image . | Thyroid . |

N/20 | 23.92 ± 1.06 | 20.96 ± 4.06 |

N/10 | 17.70 ± 1.38 | 15.21 ± 4.18 |

N/5 | 12.14 ± 0.99 | 14.38 ± 4.12 |

N/4 | 11.20 ± 0.8 | 13.33 ± 2.81 |

N/3 | 9.97 ± 1.4 | 13.23 ± 4.85 |

N/2 | 9.76 ± 1.08 | 20.62 ± 3.59 |

Table 4 reports the comparative test set error (%) values for the same methods. It is clear from this table that the generalization performance (in terms of test set error) of the proposed method using *N*/5 basis vectors is close to that of the full model GP classifier on most of the data sets. Further, the performance of the classifier designed using κ = |**u**^{c}| is close to that of the classifier designed using κ = 59.

Data Set . | Full Model GP . | Baseline . | NLP(κ = |u^{c}|)
. | NLP(κ = 59) . | IVM . |
---|---|---|---|---|---|

Banana | 10.6 ± 0.57 | 12.74 ± 1.36 | 10.56 ± 0.5 | 10.58 ± 0.52 | 10.36 ± 0.48 |

Breast Cancer | 26.21 ± 4.71 | 28.04 ± 4.76 | 26.53 ± 5.5 | 26.88 ± 4.64 | 35.10 ± 12.26 |

Diabetes | 23.43 ± 1.86 | 25.04 ± 2.24 | 24.44 ± 1.79 | 24.22 ± 1.76 | 28.87 ± 8.6 |

German | 23.44 ± 2.03 | 25.4 ± 2.35 | 23.76 ± 2.22 | 24.07 ± 2.19 | 27.06 ± 5.75 |

Heart | 16.27 ± 3.9 | 18.71 ± 3.71 | 16.32 ± 2.77 | 16.37 ± 2.68 | 21.07 ± 9.3 |

Image | 3.11 ± 0.49 | 8.43 ± 1.63 | 4.59 ± 0.45 | 4.51 ± 0.53 | 3.99 ± 0.83 |

Ringnorm | 4.59 ± 0.78 | 2.32 ± 0.64 | 4.37 ± 1.02 | 4.26 ± 1.09 | 1.83 ± 0.18 |

Splice | 5.75 ± 0.59 | 9.75 ± 1.15 | 5.96 ± 0.59 | 6.13 ± 0.55 | 5.61 ± 0.67 |

Thyroid | 4.08 ± 2.1 | 6.01 ± 2.72 | 4.84 ± 2.39 | 4.97 ± 2.29 | 4.48 ± 2.13 |

Titanic | 22.61 ± 1.41 | 23.33 ± 1.30 | 22.80 ± 1.05 | 22.74 ± 1.12 | 25.54 ± 9.2 |

Twonorm | 2.74 ± 0.40 | 3.32 ± 0.43 | 2.88 ± 0.27 | 2.87 ± 0.24 | 2.64 ± 0.25 |

Waveform | 9.74 ± 0.43 | 12.14 ± 1.00 | 10.06 ± 0.42 | 10.13 ± 0.46 | 9.97 ± 0.44 |

Data Set . | Full Model GP . | Baseline . | NLP(κ = |u^{c}|)
. | NLP(κ = 59) . | IVM . |
---|---|---|---|---|---|

Banana | 10.6 ± 0.57 | 12.74 ± 1.36 | 10.56 ± 0.5 | 10.58 ± 0.52 | 10.36 ± 0.48 |

Breast Cancer | 26.21 ± 4.71 | 28.04 ± 4.76 | 26.53 ± 5.5 | 26.88 ± 4.64 | 35.10 ± 12.26 |

Diabetes | 23.43 ± 1.86 | 25.04 ± 2.24 | 24.44 ± 1.79 | 24.22 ± 1.76 | 28.87 ± 8.6 |

German | 23.44 ± 2.03 | 25.4 ± 2.35 | 23.76 ± 2.22 | 24.07 ± 2.19 | 27.06 ± 5.75 |

Heart | 16.27 ± 3.9 | 18.71 ± 3.71 | 16.32 ± 2.77 | 16.37 ± 2.68 | 21.07 ± 9.3 |

Image | 3.11 ± 0.49 | 8.43 ± 1.63 | 4.59 ± 0.45 | 4.51 ± 0.53 | 3.99 ± 0.83 |

Ringnorm | 4.59 ± 0.78 | 2.32 ± 0.64 | 4.37 ± 1.02 | 4.26 ± 1.09 | 1.83 ± 0.18 |

Splice | 5.75 ± 0.59 | 9.75 ± 1.15 | 5.96 ± 0.59 | 6.13 ± 0.55 | 5.61 ± 0.67 |

Thyroid | 4.08 ± 2.1 | 6.01 ± 2.72 | 4.84 ± 2.39 | 4.97 ± 2.29 | 4.48 ± 2.13 |

Titanic | 22.61 ± 1.41 | 23.33 ± 1.30 | 22.80 ± 1.05 | 22.74 ± 1.12 | 25.54 ± 9.2 |

Twonorm | 2.74 ± 0.40 | 3.32 ± 0.43 | 2.88 ± 0.27 | 2.87 ± 0.24 | 2.64 ± 0.25 |

Waveform | 9.74 ± 0.43 | 12.14 ± 1.00 | 10.06 ± 0.42 | 10.13 ± 0.46 | 9.97 ± 0.44 |

Notes: Test set error (%) values reported are the averages over 100 trials (20 trials for the Splice and Image data sets), along with the standard deviations. The results for the baseline and NLP methods are reported for *N*/5 basis vectors. For the IVM method, the number of basis vectors for all the data sets is *N*/2 except for the Image and Titanic data sets, for which the numbers of basis vectors are *N*/3 and 2*N*/3, respectively.

For the Ringnorm data set, though the NLP measure-based classifiers exhibited better performances in terms of the NLP loss (see Table 2), their performances in terms of test set error were inferior to the IVM or baseline method. We also note that the test set error using the proposed method with *d*_{max} = *N*/20 was observed to be 1.99%, and the NLP loss was 17.86 × 10^{−2}. This behavior is possibly due to the presence of a large number of examples near the boundary. In such a case, the NLP loss may decrease as more basis vectors are added, but this would result in a slightly higher test set error due to the change in boundary.

The IVM method needed more basis vectors (*N*/2 in the majority of cases) to achieve a similar level of generalization performance. We observed that the hyperparameter adaptation step in the IVM method resulted in small-width parameter values for the gaussian kernel. It can also be seen that the performance of the IVM method is not good especially when the data set is relatively noisy (e.g., Breast Cancer, Diabetes, German, and Titanic). Such data sets would require a larger number of basis vectors to get the desired generalization performance. It is interesting to note that the baseline method was observed to perform reasonably well on the data sets like Breast Cancer, Diabetes, and Titanic.

We also compared the computation times of the proposed method (κ = 59) with that of the IVM method (both using *N*/5 basis vectors and implemented using Matlab) on a single realization of the Banana and Image data sets. The experiments were done using a 2.4 GHz dual-core AMD-CPU based machine with 4 Gb of main memory running Red Hat Linux. For the Banana data set, the CPU times for the IVM and the proposed method were 3.96 and 29.24 seconds, respectively. The corresponding values for the Image data set were 37.1 and 1209.2 seconds, respectively. Though slow in terms of training time, the proposed method with *N*/5 basis vectors resulted in a model that gave better generalization performance compared to that obtained using the IVM method. This is also clear from Tables 2 and 4. Further, the computation time does not scale linearly with κ. This is because the hyperparameter adaptation is computationally more expensive than the basis vector selection. Note that we did not do a detailed study of a timing comparison of different methods, as it was not the aim of this letter.

From these experiments, we can see that the sparse GP classifier designed using the proposed method typically uses a smaller number of basis vectors than that designed using the IVM method and gives generalization performance comparable to that of the full model GP classifier. We also note that compared to the full model classifier, the difference in performance of the proposed classifier is not statistically significant for most of the data sets. Thus, the proposed method provides a useful practical solution to the SGPC design problem.

## 5. Conclusion

In this letter we proposed the NLP loss-based method for basis vector selection and hyperparameter adaptation for SGPC design. This loss function makes use of a moderated predictive probability score and is useful for GP models as this quantity is readily available. Numerical experiments on various real-world benchmark data sets demonstrate that the generalization performance of the sparse GP classifier designed using the proposed method is comparable to that of the full model GP classifier. Further, compared to the IVM method, it uses a smaller number of basis vectors and yields similar generalization performance. This improvement in performance, however, comes at extra computational cost compared to the IVM method. We are investigating ways to reduce this computational complexity, and the results will be reported elsewhere. It will also be interesting to study the generalization performance of the proposed method when it is allowed to delete some of the basis vectors from the model and recompute the site parameters of “all” the basis vectors currently present in the model.

## Acknowledgments

The work of the first author was partially supported by DST, government of India, grant SR/S3/EECE/61/2006. We thank anonymous referees for useful comments and suggestions.

## Notes

^{1}

Though **u** represents the index set of basis vectors, we also use it to denote the actual basis vector set **X**_{u}.

^{2}

These data sets are publicly available online at http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm.

^{3}

The software is available online at http://www.cs.man.ac.uk/~neill/ivm/.

^{4}

Though theoretical justification exists for setting κ to 59 (Smola & Bartlett, 2000), it was observed that the computational complexity of the algorithm can be improved (without significant change in the generalization performance) by setting κ to smaller values. In particular, we observed that the generalization performances on these data sets using κ = 59 and κ = 30 were not significantly different. We are, however, not aware of any automatic technique to select κ for a given data set.