## Abstract

The aim of this letter is to propose a theory of deep restricted kernel machines offering new foundations for deep learning with kernel machines. From the viewpoint of deep learning, it is partially related to restricted Boltzmann machines, which are characterized by visible and hidden units in a bipartite graph without hidden-to-hidden connections and deep learning extensions as deep belief networks and deep Boltzmann machines. From the viewpoint of kernel machines, it includes least squares support vector machines for classification and regression, kernel principal component analysis (PCA), matrix singular value decomposition, and Parzen-type models. A key element is to first characterize these kernel machines in terms of so-called conjugate feature duality, yielding a representation with visible and hidden units. It is shown how this is related to the energy form in restricted Boltzmann machines, with continuous variables in a nonprobabilistic setting. In this new framework of so-called restricted kernel machine (RKM) representations, the dual variables correspond to hidden features. Deep RKM are obtained by coupling the RKMs. The method is illustrated for deep RKM, consisting of three levels with a least squares support vector machine regression level and two kernel PCA levels. In its primal form also deep feedforward neural networks can be trained within this framework.

## 1 Introduction

Deep learning has become an important method of choice in several research areas including computer vision, speech recognition, and language processing (LeCun, Bengio, & Hinton, 2015). Among the existing techniques in deep learning are deep belief networks, deep Boltzmann machines, convolutional neural networks, stacked autoencoders with pretraining and fine-tuning, and others (Bengio, 2009; Goodfellow, Bengio, & Courville, 2016; Hinton, 2005; Hinton, Osindero, & Teh, 2006; LeCun et al., 2015; Lee, Grosse, Ranganath, & Ng, 2009; Salakhutdinov, 2015; Schmidhuber, 2015; Srivastava & Salakhutdinov, 2014; Chen, Schwing, Yuille, & Urtasun, 2015; Jaderberg, Simonyan, Vedaldi, & Zisserman, 2014; Schwing & Urtasun, 2015; Zheng et al., 2015). Support vector machines (SVM) and kernel-based methods have made a large impact on a wide range of application fields, together with finding strong foundations in optimization and learning theory (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995; Rasmussen & Williams, 2006; Schölkopf & Smola, 2002; Suykens, Van Gestel, De Brabanter, De Moor, & Vandewalle, 2002; Vapnik, 1998; Wahba, 1990). Therefore, one can pose the question: Which synergies or common foundations could be developed between these different directions? There has already been exploration of such synergies— for example, in kernel methods for deep learning (Cho & Saul, 2009), deep gaussian processes (Damianou & Lawrence, 2013; Salakhutdinov & Hinton, 2007), convolutional kernel networks (Mairal, Koniusz, Harchaoui, & Schmid, 2014), multilayer support vector machines (Wiering & Schomaker, 2014), and mathematics of the neural response (Smale, Rosasco, Bouvrie, Caponnetto, & Poggio, 2010), among others.

In this letter, we present a new theory of deep restricted kernel machines (deep RKM), offering foundations for deep learning with kernel machines. It partially relates to restricted Boltzmann machines (RBMs), which are used within deep belief networks (Hinton, 2005; Hinton et al., 2006). In RBMs, one considers a specific type of Markov random field, characterized by a bipartite graph consisting of a layer of visible units and another layer of hidden units (Bengio, 2009; Fisher & Igel, 2014; Hinton et al., 2006; Salakhutdinov, 2015). In RBMs, which are related to harmoniums (Smolensky, 1986; Welling, Rosen-Zvi, & Hinton, 2004), there are no connections between the hidden units (Hinton, 2005), and often also no visible-to-visible connections. In deep belief networks, the hidden units of a layer are mapped to a next layer in order to create a deep architecture. In RBM, one considers stochastic binary variables (Ackley, Hinton, & Sejnowski, 1985; Hertz, Krogh, & Palmer, 1991), and extensions have been made to gaussian-Bernoulli variants (Salakhutdinov, 2015). Hopfield networks (Hopfield, 1982) take continuous values, and a class of Hamiltonian neural networks has been studied in DeWilde (1993). Also, discriminative RBMs have been studied where the class labels are considered at the level of visible units (Fisher & Igel, 2014; Larochelle & Bengio, 2008). In all of these methods the energy function plays an important role, as it also does in energy-based learning methods (LeCun, Chopra, Hadsell, Ranzato, & Huang, 2006).

Representation learning issues are considered to be important in deep learning (Bengio, Courville, & Vincent, 2013). The method proposed in this letter makes a link to restricted Boltzmann machines by characterizing several kernel machines by means of so-called conjugate feature duality. Duality is important in the context of support vector machines (Boser et al., 1992; Cortes & Vapnik, 1995; Vapnik, 1998; Suykens et al., 2002; Suykens, Alzate, & Pelckmans, 2010), optimization (Boyd & Vandenberghe, 2004; Rockafellar, 1987), and in mathematics and physics in general. Here we consider hidden features conjugated to part of the unknown variables. This part of the formulation is linked to a restricted Boltzmann machine energy expression, though with continuous variables in a nonprobabilistic setting. In this way, a model can be expressed in both its primal representation and its dual representation and give an interpretation in terms of visible and hidden units, in analogy with RBM. The primal representation contains the feature map, while the dual model representation is expressed in terms of the kernel function and the conjugated features.

The class of kernel machines discussed in this letter includes least squares support vector machines (LS-SVM) for classification and regression, kernel principal component analysis (kernel PCA), matrix singular value decomposition (matrix SVD), and Parzen-type models. These have been previously conceived within a primal and Lagrange dual setting in Suykens and Vandewalle (1999b), Suykens et al. (2002), Suykens, Van Gestel, Vandewalle, and De Moor (2003), and Suykens (2013, 2016). Other examples are kernel spectral clustering (Alzate & Suykens, 2010; Mall, Langone, & Suykens, 2014), kernel canonical correlation analysis (Suykens et al., 2002), and several others, which will not be addressed in this letter, but can be the subject of future work. In this letter, we give a different characterization for these models, based on a property of quadratic forms, which can be verified through the Schur complement form. The property relates to a specific case of Legendre-Fenchel duality (Rockafellar, 1987). Also note that in classical mechanics, converting a Lagrangian into Hamiltonian formulation is by Legendre transformation (Goldstein, Poole, & Safko, 2002).

The kernel machines with conjugate feature representations are used then as building blocks to obtain the deep RKM by coupling the RKMs. The deep RKM becomes unrestricted after coupling the RKMs. The approach is explained for a model with three levels, consisting of two kernel PCA levels and a level with LS-SVM classification or regression. The conjugate features of level 1 are taken as input of level 2 and, subsequently, the features of level 2 as input for level 3. The objective of the deep RKM is the sum of the objectives of the RKMs in the different levels. The characterization of the stationary points leads to solving a set of nonlinear equations in the unknowns, which is computationally expensive. However, for the case of linear kernels, in part of the levels it reveals how kernel fusion is taking place over the different levels. For this case, a heuristic algorithm is obtained with level-wise solving. For the general nonlinear case, a reduced-set algorithm with estimation in the primal is proposed.

In this letter, we make a distinction between levels and layers. We use the terminology of levels to indicate the depth of the model. The terminology of *layers* is used here in connection to the feature map. Suykens and Vandewalle (1999a) showed how a multilayer perceptron can be trained by a support vector machine method. It is done by defining the hidden layer to be equal to the feature map. In this way, the hidden layer is treated at the feature map and the kernel parameters level. Suykens et al. (2002) explained that in SVM and LS-SVM models, one can have a neural networks interpretation in both the primal and the dual. The number of hidden units in the primal equals the dimension of the feature space, while in the dual representation, it equals the number of support vectors. In this way, it provides a setting to work with parametric models in the primal and kernel-based models in the dual. Therefore, we also illustrate in this letter how deep multilayer feedforward neural networks can be trained within the deep RKM framework. While in classical backpropagation (Rumelhart, Hinton, & Williams, 1986), one typically learns the model by specifying a single objective (e.g., unless imposing additional stability constraints to obtain stable multilayer recurrent networks with dynamic backpropagation; (Suykens, Vandewalle, & De Moor, 1995), in the deep RKM the objective function consists of the different objectives related to the different levels.

In summary, we aim at contributing to the following challenging questions in this letter:

- •
Can we find new synergies and foundations between SVM and kernel methods and deep learning architectures?

- •
Can we extend primal and dual model representations, as occurring in SVM and LS-SVM models, from shallow to deep architectures?

- •
Can we handle deep feedforward neural networks and deep kernel machines within a common setting?

In order to address these questions, this letter is organized as follows. Section 2 outlines the context of this letter with a brief introductory part on restricted Boltzmann machines, SVMs, LS-SVMs, kernel PCA, and SVD. In section 3 we explain how these kernel machines can be characterized by conjugate feature duality with visible and hidden units. In section 4 deep restricted kernel machines are explained for three levels: an LS-SVM regression level and two additional kernel PCA levels. In section 5, different algorithms are proposed for solving in either the primal or the dual, where the former will be related to deep feedfoward neural networks and the latter to kernel-based models. Illustrations with numerical examples are given in section 6. Section 7 concludes the letter.

## 2 Preliminaries and Context

In this section, we explain basic principles of restricted Boltzmann machines, SVMs, LS-SVMs, and related formulations for kernel PCA, and SVD. These are basic ingredients needed before introducing restricted kernel machines in section 3.

### 2.1 Restricted Boltzmann Machines

*v*and

*h*, respectively, have stochastic binary units with value 0 or 1. A joint state is defined for these visible and hidden variables with energy (see Figure 1), where are the model parameters,

*W*is an interaction weight matrix, and contain bias terms.

*T*steps initialized at the data. Often one takes , while recovers the maximum likelihood approach (Salakhutdinov, 2015).

### 2.2 Least Squares Support Vector Machines and Related Kernel Machines

#### 2.2.1 SVM and LS-SVM

*K*is used with . The SVM classifier is expressed in the dual as where denotes the set of support vectors, corresponding to the nonzero values. Common choices are, for example, to take a linear , polynomial with , or gaussian RBF kernel .

*b*, which gives with the predicted output where . The classifier formulation can also be transformed into the regression formulation by multiplying the constraints in equation 2.9 by the class labels and considering new error variables (Suykens et al., 2002). In the zero bias term case, this corresponds to kernel ridge regression (Saunders, Gammerman, & Vovk, 1998), which is also related to function estimation in reproducing kernel Hilbert spaces, regularization networks, and gaussian processes, within a different setting (Poggio & Girosi, 1990; Wahba, 1990; Rasmussen & Williams, 2006; Suykens et al., 2002).

#### 2.2.2 Kernel PCA and Matrix SVD

*L*

_{2}loss function, typical for LS-SVMs, one can characterize the kernel PCA problem (Schölkopf, Smola, & Müller, 1998) as follows, as shown in Suykens et al. (2002, 2003): From the KKT conditions, one obtains the following in the Lagrange multipliers , where are the elements of the centered kernel matrix , and . In equation 2.15, maximizing instead of minimizing also leads to equation 2.16. The centering of the kernel matrix is obtained as a result of taking a bias term

*b*in the model. The value is treated at a selection level and is chosen so as to correspond to , where are eigenvalues of . In the zero bias term case, becomes the kernel matrix . Also, kernel spectral clustering (Alzate & Suykens, 2010) was obtained in this setting by considering a weighted version of the

*L*

_{2}loss part, weighted by the inverse of the degree matrix of the graph in the clustering problem.

*A*, by choosing the linear feature maps , with a compatibility matrix

*C*that satisfies , this eigenvalue problem corresponds to the SVD of matrix

*A*(Suykens, 2016) in connection with Lanczos’s decomposition theorem. One can also see that for a symmetric matrix, the two data sources coincide, and the objective of equation 2.17 reduces to the kernel PCA objective, equation 2.15 (Suykens, 2016), involving only one feature map instead of two feature maps.

## 3 Restricted Kernel Machines and Conjugate Feature Duality

### 3.1 LS-SVM Regression as a Restricted Kernel Machine: Linear Case

A training data set is assumed to be given with input data and output data (now with *p* outputs), where the data are assumed to be identical and independently distributed and drawn from an unknown but fixed underlying distribution *P*(*x*,*y*), a common assumption made in statistical learning theory (Vapnik, 1998).

*x*where , . Note that

*b*is treated as part of the interconnection matrix by adding a constant 1 within the vector

*v*, which is also frequently done in the area of neural networks (Suykens et al., 1995). While in RBM the units are binary valued, in the RKM, they are continuous valued. The notation

*R*in refers to the fact that the expression is restricted; there are no hidden-to-hidden connections.

*J*.

^{1}This is based on the property that for two arbitrary vectors , one has The maximal value of the right-hand side in equation 3.5 is obtained for , which follows from and . The maximal value that can be obtained for the right-hand side equals the left-hand side, . The property 3.5 can also be verified by writing it in quadratic form, which holds. This follows immediately from the Schur complement form,

^{2}which results in the condition , which holds. Writing equation 3.5 as gives a property that is also known in Legendre-Fenchel duality for the case of a quadratic function (Rockafellar, 1987). Furthermore, it also follows from equation 3.5 that We will call the method of introducing the hidden features

*h*into equation 3.4

_{i}*conjugate feature duality*, where the hidden features

*h*are conjugated to the

_{i}*e*. Here, will be called an

_{i}*inner pairing*between the

*e*and the hidden features

_{i}*h*(see Figure 2).

_{i}^{3}The first condition yields , which means that the maximal value of is reached. Therefore, . Also note the similarity between the condition and equation 2.4 in the contrastive divergence algorithm. Elimination of

*h*from this set of conditions gives the solution in : Elimination of

_{i}*W*from the set of conditions gives the solution in : with denoting the matrix with -entry , , . From this square linear system, one can solve and

*b*. denotes a vector of all ones of size

*N*and

*I*the identity matrix of size .

_{N}*h*take the same role as the Lagrange dual variables in the LS-SVM formulation based on Lagrange duality, equation 2.13, when taking and . For the estimated values on the training data, one can express the model in terms of or in terms of . In the restricted kernel machine interpretation of the LS-SVM regression, one has the following primal and dual model representations: evaluated at a point

_{i}*x*where the primal representation is in terms of and the dual representation is in the hidden features

*h*. The primal representation is suitable for handling the “large

_{i}*N*, small

*d*” case, while the dual representation for “small

*N*, large

*d*” (Suykens et al., 2002).

### 3.2 Nonlinear Case

*x*by where denotes the feature map, with

_{i}*n*the dimension of the feature space. Therefore, the objective function for, the RKM interpretation becomes with the vector of visible units with equal to Following the same approach as in the linear case, one then obtains as a solution in the primal In the conjugate feature dual, one obtains the same linear system as equation 3.11, but with the positive-definite kernel instead of the linear kernel : We also employ the notation to denote the kernel matrix

_{f}*K*with the th entry equal to .

As Suykens et al. (2002) explained, one can also give a neural network interpretation to both the primal and the dual representation, with a number of hidden units equal to the dimension of the feature space for the primal representation and the number of support vectors in the dual representation, respectively. For the case of a gaussian RBF kernel, one has a one-hidden-layer interpretation with an infinite number of hidden units in the primal, while in the dual, the number of hidden units equals the number of support vectors.

### 3.3 Classifier Formulation

### 3.4 Kernel PCA

*W*gives the following solution in the conjugated features, where and with the number of selected components. One can verify that the solutions corresponding to the different eigenvectors

*h*and their corresponding eigenvalues all lead to the value .

_{i}Here the number of hidden units equals *s* with and the visible units with , and .

### 3.5 Singular Value Decomposition

### 3.6 Kernel pmf

## 4 Deep Restricted Kernel Machines

In this section we couple different restricted kernel machines within a deep architecture. Several coupling configurations are possible at this point. We illustrate deep restricted kernel machines here for an architecture consisting of three levels. We discuss two configurations:

Two kernel PCA levels followed by an LS-SVM regression level

LS-SVM regression level followed by two kernel PCA levels

In the first architecture, the first two levels extract features that are used within the last level for classification or regression. Related types of architectures are stacked autoencoders (Bengio, 2009), where a pretraining phase provides a good initialization for training the deep neural network in the fine-tuning phase. The deep RKM will consider an objective function jointly related to the kernel PCA feature extractions and the classification or regression. We explain how the insights of the RKM kernel PCA representations can be employed for combined supervised training and feature selection. A difference with other methods is also that conjugated features are used within the layered architecture.

In the second architecture, one starts with regression and then lets two kernel PCA levels further act on the residuals. In this case connections will be shown with deep Boltzmann machines (Salakhutdinov, 2015; Salakhutdinov & Hinton, 2009) when considering the special case of linear feature maps, though for the RKMs in a nonprobabilistic setting.

### 4.1 Two Kernel PCA Levels Followed by Regression Level

We focus here on a deep RKM architecture consisting of three levels:

- •
Level 1 consists of kernel PCA with given input data

*x*and is characterized by conjugated features ._{i} - •
Level 2 consists of kernel PCA by taking as input and is characterized by conjugated features .

- •
Level 3 consists of LS-SVM regression on with output data

*y*and is characterized by conjugated features ._{i}

*c*

_{stab}a positive constant. The role of this stabilization term for the kernel PCA levels is explained in the appendix. While in stacked autoencoders one has an unsupervised pretraining and a supervised fine-tuning phase (Bengio, 2009), here we train the whole network at once.

*b*: Solving this set of nonlinear equations is computationally expensive. However, for the case of taking linear kernels

*K*

_{2}and

*K*

_{3}(and denoting linear kernels) equation 4.10 simplifies to Here we denote , , . One sees that at levels 1 and 2, a data fusion is taking place between

*K*

_{1}and

*K*

_{lin}and between and

*K*

_{lin}, where are specifying the relative weight given to each of these kernels. In this way, one can choose for emphasizing or deemphasizing the levels with respect to each other.

### 4.2 Regression Level Followed by Two Kernel PCA Levels

In this case, we consider a deep RKM architecture with the following three levels:

- •
Level 1 consists of LS-SVM regression with given input data

*x*and output data_{i}*y*and is characterized by conjugated features ._{i} - •
Level 2 consists of kernel PCA by taking as input and is characterized by conjugated features .

- •
Level 3 consists of kernel PCA by taking as input and is characterized by conjugated features .

*J*

_{deep}, the sum of the three inner pairing terms is similar to the energy in deep Boltzmann machines (Salakhutdinov, 2015; Salakhutdinov & Hinton, 2009) for the particular case of linear feature maps and symmetric interaction terms. For the special case of linear feature maps, one has which takes the same form as equation 29 in Salakhutdinov (2015), with defined in the sense of equation 3.1 in this letter. The “

*U*” in

*U*

_{deep}refers to the fact that the deep RKM is unrestricted after coupling because of the hidden-to-hidden connections between layers 1 and 2 and between layers 2 and 3, while the uncoupled RKMs are restricted.

*b*: When taking linear kernels

*K*

_{2}and

*K*

_{3}, the set of nonlinear equations simplifies to with a similar data fusion interpretation as explained in the previous subsection.

## 5 Algorithms for Deep RKM

The characterization of the stationary points for the objective functions in the different deep RKM models typically leads to solving large sets of nonlinear equations in the unknown variables, especially for large given data sets. Therefore, in this section, we outline a number of approaches and algorithms for working with the kernel-based models (in either the primal or the dual). We also outline algorithms for training deep feedforward neural networks in a parametric way in the primal within the deep RKM setting. The algorithms proposed in sections 5.2 and 5.3 are applicable also to large data sets.

### 5.1 Levelwise Solving for Kernel-Based Models

For the case of linear kernels in levels 2 and 3 in equation 4.11 and 4.18, we propose a heuristic algorithm that consists of level-wise solving linear systems and eigenvalue decompositions by alternating fixing different unknown variables.

For equation 4.18, in order to solve level 1 as a linear system, one needs the input/output data , but also the knowledge of . Therefore, an initialization phase is required. One can initialize as zero or at random at level 1, obtain *H*_{1}, and propagate it to level 2. At level 2, after initializing *H*_{3}, one finds *H*_{2}, which is then propagated to level 3, where one computes *H*_{3}. After this forward phase, one can go backward from level 3 to level 1 in a backward phase.

Schematically this gives the following heuristic algorithm:

One can repeat the forward and backward phases a number of times, without the initialization step. Alternatively, one could also apply an algorithm with forward-only phases, which can then be applied a number of times after each other.

### 5.2 Deep Reduced Set Kernel-Based Models with Estimation in Primal

One can also maximize by adding a term to the objective, equation 5.3, with *c*_{0} a positive constant. Note that the components of in levels 1 and 2 do not possess an orthogonality property unless this is imposed as additional constraints to the objective function.

### 5.3 Training Deep Feedforward Neural Networks within the Deep RKM Framework

In order to further reduce the number of unknowns, and partially inspired by convolutional operations in convolutional neural networks (LeCun, Bottou, Bengio, & Haffner, 1998), we also consider the case where *U*_{1} and *U*_{2} are Toeplitz matrices. For a matrix , the number of unknowns is reduced then from to .

## 6 Numerical Examples

### 6.1 Two Kernel PCA Levels Followed by Regression Level: Examples

We define the following models and methods for comparison:

- •
:

*Deep reduced set kernel-based models*(with RBF kernel) with estimation in the primal according to equation 5.3 with the following choices:: with additional term and () regularization terms

: without additional term

: with objective function , that is, only the level 3 regression objective.

- •
:

*Deep feedforward neural networks*with estimation in the primal according to equation 5.5 with the same choices in , , as above in . In the model Toeplitz matrices are taken for the*U*matrices in all levels, except for the last level.

We test and compare the proposed algorithms on a number of UCI data sets: Pima indians diabetes () (), Bupa liver disorder () (), Johns Hopkins University ionosphere () (), adult () () data sets, where the number of inputs (*d*), outputs (*p*), training (*N*), validation (*N*_{val}), and test data (*N*_{test}) are indicated. These numbers correspond to previous benchmarking studies in Van Gestel et al. (2004). In Table 1, *bestbmark* indicates the best result obtained in the benchmarking study of Van Gestel et al. (2004) from different classifiers, including SVM and LS-SVM classifiers with linear, polynomial, and RBF kernel; linear and quadratic discriminant analysis; decision tree algorithm C4.5; logistic regression; one-rule classifier; instance-based learners; and Naive Bayes.

The tuning parameters, selected at the validation level, are

- •
: For : ; ; ; (); (). For : ; ; ; (); (). For : ; ; ;

- •
: For : ; ; ; (); (). For : ; ; ; (); () For : ; ; ; .

- •
: For : ; ; ; (); (). For : ; ; ; (); () For : ; ; ; .

- •
: For : ; ; ; (); (). For : ; ; ; (); (). For : ; ; ; , .

The other tuning parameters were selected as for , , and , for , unless specified differently above. In the and models, the matrices and the interconnection matrices were initialized at random according to a normal distribution with zero mean and standard deviation 0.1 (100, 20, 10, and 3 initializations for , , , , respectively), the diagonal matrices by the identity matrix, and for the RBF kernel models in . For the training, a quasi-Newton method was used with in Matlab.

The following general observations from the experiments are shown in Table 1:

- •
Having the additional terms with kernel PCA objectives in levels 1 and 2, as opposed to the level 3 objective only, gives improved results on all tried data sets.

- •
The best selected value for

*c*_{stab}varies among the data sets. In case this value is large, the value of the objective function terms related to the kernel PCA parts is close to zero. - •
The use of Toeplitz matrices for the

*U*matrices in the deep feedforward neural networks leads to competitive performance results and greatly reduces the number of unknowns.

Figure 4 illustrates the evolution of the objective function (in logarithmic scale) during training on the data set, for different values of *c*_{stab} and in comparison with a level 3 objective function only.

. | . | . | . | . |
---|---|---|---|---|

19.53 [20.02(1.53)] | 26.09 [30.96(3.34)] | 0 [0.68(1.60)] | 16.99 [17.46(0.65)] | |

18.75 [19.39(0.89)] | 25.22 [31.48(4.11)] | 0 [5.38(12.0)] | 17.08 [17.48(0.56)] | |

21.88 [24.73(5.91)] | 28.69 [32.39(3.48)] | 0 [8.21(6.07)] | 17.83 [21.21(4.78)] | |

21.09 [20.20(1.51)] | 27.83 [28.86(2.83)] | 1.71 [5.68(2.22)] | 15.07 [15.15(0.15)] | |

18.75 [20.33(2.75)] | 28.69 [28.38(2.80)] | 10.23 [6.92(3.69)] | 14.91 [15.08 (0.15)] | |

19.03 [19.16(1.10)] | 26.08 [27.74(9.40)] | 6.83 [6.50(8.31)] | 15.71 [15.97(0.07)] | |

24.61 [22.34(1.95)] | 32.17 [27.61(3.69)] | 3.42 [9.66(6.74)] | 15.21 [15.19(0.08)] | |

bestbmark | 22.7(2.2) | 29.6(3.7) | 4.0(2.1) | 14.4(0.3) |

. | . | . | . | . |
---|---|---|---|---|

19.53 [20.02(1.53)] | 26.09 [30.96(3.34)] | 0 [0.68(1.60)] | 16.99 [17.46(0.65)] | |

18.75 [19.39(0.89)] | 25.22 [31.48(4.11)] | 0 [5.38(12.0)] | 17.08 [17.48(0.56)] | |

21.88 [24.73(5.91)] | 28.69 [32.39(3.48)] | 0 [8.21(6.07)] | 17.83 [21.21(4.78)] | |

21.09 [20.20(1.51)] | 27.83 [28.86(2.83)] | 1.71 [5.68(2.22)] | 15.07 [15.15(0.15)] | |

18.75 [20.33(2.75)] | 28.69 [28.38(2.80)] | 10.23 [6.92(3.69)] | 14.91 [15.08 (0.15)] | |

19.03 [19.16(1.10)] | 26.08 [27.74(9.40)] | 6.83 [6.50(8.31)] | 15.71 [15.97(0.07)] | |

24.61 [22.34(1.95)] | 32.17 [27.61(3.69)] | 3.42 [9.66(6.74)] | 15.21 [15.19(0.08)] | |

bestbmark | 22.7(2.2) | 29.6(3.7) | 4.0(2.1) | 14.4(0.3) |

Notes: Shown first is the test error corresponding to the selected model with minimal validation error from the different random initializations. Between brackets, the mean and standard deviation of the test errors related to all initializations are shown. The lowest test error is in bold.

### 6.2 Regression Level Followed by Two Kernel PCA Levels: Examples

#### 6.2.1 Regression Example on Synthetic Data Set

#### 6.2.2 Multiclass Example: USPS

In this example, the USPS handwritten digits data set is taken from http://www.cs.nyu.edu/∼roweis/data.html. It contains 8-bit grayscale images of digits 0 through 9 with 1100 examples of each class. These data are used without additional scaling or preprocessing. We compare a basic LS-SVM model (with primal representation and with , that is, one output per class, and RBF kernel) with deep RKM consisting of LS-SVM + KPCA + KPCA with RBF kernel in levels 1 and linear kernels in levels 2 and 3 (with number of selected components , in levels 2 and 3). In level 1 of deep RKM, the same type of model is taken as in the basic LS-SVM model. In this way, we intend to study the effect of the two additional KPCA layers. The dimensionality of the input data is . Two training set sizes were taken ( and data points, that is, 200 and 400 examples per class), 2000 data points (200 per class) for validation, and 5000 data (500 per class) for testing. The tuning parameters are selected based on the validation set: , for the RBF kernel in the basic LS-SVM model and , , , ( has been chosen) for deep RKM. The number of forward-backward passes in the deep RKM is chosen equal to 2. The results are shown for the case of 2000 training data in Figure 5, showing the results on training, validation, and test data with the predicted class labels and the predicted output values for the different classes. For the case , the selected values were , , , , . The misclassification error on the test data set is for the deep RKM and for the basic LS-SVM (with and ). For the case , the selected values were , , , , . The misclassification error on the test data set is for the deep RKM and for the basic LS-SVM (with and ). This illustrates that for deep RKM, levels 2 and 3 are given high relative importance through the selection of large , values.

#### 6.2.3 Multiclass Example: MNIST

The data set, which is used without additional scaling or preprocessing, is taken from http://www.cs.nyu.edu/∼roweis/data.html. The dimensionality of the input data is (images of size 28 × 28 for each of the 10 classes). In this case, we take an ensemble approach where the training set ( with 10 classes) has been partitioned into small nonoverlapping subsets of size 50 (5 data points per class). The choice for this subset size resulted from taking the last 10,000 points of this data set as validation data with the use of 40,000 data for training in that case. Other tuning parameters were selected in a similar way. The 1000 resulting submodels have been linearly combined after applying the function to their outputs. The linear combination is determined by solving an overdetermined linear system with ridge regression, following a similar approach as discussed in section 6.4 of Suykens et al. (2002). For the submodels, deep RKMs consisting of LSSVM + KPCA + KPCA with RBF kernel in levels 1 and linear kernels in levels 2 and 3, are taken. The selected tuning parameters are , , , , , . The number of forward-backward passes in the deep RKM is chosen equal to 2. The training data set has been extended with another 50,000 training data consisting of the same data points but corrupted with noise (random perturbations with zero mean and standard deviation 0.5, truncated to the range [0,1]), which is related to the method with random perturbations in Kurakin, Goodfellow, and Bengio (2016). The misclassification error on the test data set (10,000 data points) is , which is comparable in performance to deep belief networks () and in between the reported test performances of deep Boltzmann machines () and SVM with gaussian kernel () (Salakhutdinov, 2015) (see http://yann.lecun.com/exdb/mnist/ for an overview and comparison of performances obtained by different methods).

## 7 Conclusion

In this letter, a theory of deep restricted kernel machines has been proposed. It is obtained by introducing a notion of conjugate feature duality where the conjugate features correspond to hidden features. Existing kernel machines such as least squares support vector machines for classification and regression, kernel PCA, matrix SVD, and Parzen-type models are considered as building blocks within a deep RKM and are characterized through the conjugate feature duality. By means of the inner pairing, one achieves a link with the energy expression of restricted Boltzmann machines, though with continuous variables in a nonprobabilistic setting. It also provides an interpretation of visible and hidden units. Therefore, this letter connects, on the one hand, to deep learning methods and, on the other hand, to least squares support vector machines and kernel methods. In this way, the insights and foundations achieved in these different research areas could possibly mutually reinforce each other in the future. Much future work is possible in different directions, including efficient methods and implementations for big data, the extension to other loss functions and regularization schemes, treating multimodal data, different coupling schemes, and models for clustering and semisupervised learning.

## Appendix: Stabilization Term for Kernel PCA

*w*and

*e*yields with , which is the solution that is also obtained for the original formulation (corresponding to ).

_{i}## Notes

Note that also the term appears. This would in a Boltzmann machine energy correspond to matrix *G* equal to the identity matrix. The term is an additional regularization term.

This states that for a matrix , one has if and only if and the Schur complement (Boyd & Vandenberghe, 2004).

The following properties are used throughout this letter: for matrices and vectors (Petersen & Pedersen, 2012).

## Acknowledgments

The research leading to these results has received funding from the European Research Council (FP7/2007-2013) / ERC AdG A-DATADRIVE-B (290923) under the European Union’s Seventh Framework Programme. This letter reflects only my views; the EU is not liable for any use that may be made of the contained information; Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/Postdoc grants; Flemish government: FWO: PhD/postdoc grants, projects: G0A4917N (deep restricted kernel machines), G.0377.12 (Structured systems), G.088114N (tensor-based data similarity); IWT: PhD/postdoc grants, projects: SBO POM (100031); iMinds Medical Information Technologies SBO 2014; Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, dynamical systems, control and optimization, 2012–2017).