Recently, a new framework, Fredholm learning, was proposed for semisupervised learning problems based on solving a regularized Fredholm integral equation. It allows a natural way to incorporate unlabeled data into learning algorithms to improve their prediction performance. Despite rapid progress on implementable algorithms with theoretical guarantees, the generalization ability of Fredholm kernel learning has not been studied. In this letter, we focus on investigating the generalization performance of a family of classification algorithms, referred to as Fredholm kernel regularized classifiers. We prove that the corresponding learning rate can achieve ( is the number of labeled samples) in a limiting case. In addition, a representer theorem is provided for the proposed regularized scheme, which underlies its applications.
Many scientific problems (e.g., regression and classification) come down to learning a prediction rule from the given finite input-output samples. Kernel tricks and methods based on integral operators provide powerful tools for learning tasks and have become central to machine learning. In order to construct a good predictor, one usually chooses a function from a class of functions (hypothesis space) using regularized learning schemes associated with certain loss functions.
Regularized kernel learning has attracted much attention due to its solid theoretical foundations and successful practical applications. Recently a new kernel learning framework, Fredholm learning, has been proposed by reformulating the learning problem as a regularized Fredholm integral equation (Que, Belkin & Wang, 2014; Que & Belkin, 2013). This framework allows a way to incorporate unlabeled data into learning algorithms and can be interpreted as a special form of kernel method with a data-dependent kernel—the Fredholm kernel. It has been shown that the Fredholm classification algorithm can reduce the variance of kernel function evaluations at data noise and improve the prediction accuracy and robustness of kernel methods (Que et al., 2014).
Despite rapid progress on theoretical and empirical evaluations, the generalization performance of Fredholm kernel learning remains unknown. This letter makes efforts to answer the question. Specifically, we focus on error analysis for a family of classification algorithms, Fredholm kernel regularized classifiers, and establishing the corresponding generalization error bound. We show that the learning rate of Fredholm kernel regularized classifiers can achieve under mild conditions. In addition, we also justify its representer theorem, which makes the solution of Fredholm kernel learning model computation accessible.
2.1 Classification in Learning Theory
We say that is a normalized classification loss function if it is convex, differentiable at 0 with , and the smallest zero of is 1.
Typical examples of classification loss include hinge loss for support vector machine (SVM) for SVM -norm () soft margin classifier, and least square loss .
2.2 Fredholm Learning Framework
At first glance, the Fredholm learning framework defined in equation 2.8 looks similar to standard regularized learning, equation 2.5. However, makes a significant difference. The density-dependent hypothesis space enables us to integrate the information from unlabeled data. In contrast, most kernels used in a traditional kernel learning framework—for example, linear kernel, gaussian kernel, and polynomial kernel—are completely independent of data distribution. In particular, when setting the kernel to be -function, formulation 2.8 is reduced to the standard regularized learning framework.
The main goal of this letter is to investigate the generalization performance of equation 2.10. Specifically, we expect to give an explicit convergence rate for Fredholm kernel regularized classifiers under some mild conditions. The following proposition states the representer theorem for Fredholm kernel regularized classifiers:
3 Bounds of Generalization Error
The generalization analysis aims at bounding the misclassification error . Nevertheless, the algorithm is constructed by minimizing regularized empirical error associated with the loss function . Hence it seems necessary to build a bridge between the excess misclassification error and the excess convex risk. Fortunately, researchers have established comparison theorems to solve this problem (Bartlett & Mendelson, 2002; Zhang, 2004; Chen, Wu, Yin & Zhou, 2004; Wu & Zhou, 2005). Here we mention some results that will be used in this letter.
Furthermore, we can get tight comparison bounds when the distribution satisfies the Tsybakov noise condition (Tsybakov, 2004).
Note that all the distributions satisfy equation 3.1 with and . Tsybakov (2004) considered the convergence rate of the risk of a function that minimizes empirical risk over a fixed class and demonstrated that a fast convergence rate can be achieved under the Tsybakov noise condition. We assume that satisfies the Tsybakov condition to obtain a fast convergence rate.
Since Fredholm kernel regularized classifiers are obtained by composing the sgn function with a real-value function , we expect to improve the error estimates by projecting the estimator into .
It is easy to check that . A well-developed approach for conducting a generalization analysis of the regularization algorithm in RKHS is error decomposition, which allows the excess generalization error to be decomposed into sample error and approximation error (Zhou, 2002; Cucker & Smale, 2001). In the Fredholm learning framework, we formulate error decomposition in a similar way by introducing a data-independent regularized function. We first introduce some conditions on the capacity of hypothesis space and the approximation ability of Fredholm learning framework. The covering number (Zhou, 2002, 2003; Shi, Feng, & Zhou, 2011) is used to describe the capacity of a function space.
It can be observed from theorem 13 that the generalization bound relies on the capacity condition, the approximation condition, and the choice of regularization parameter . Specifically, the labeled data play a key role on the generalization bound without the extra assumption on marginal distribution, which is consistent with the theoretical analysis for semisupervised learning (Belkin & Niyogi, 2006; Chen, Zhou, Tang, Li, & Pan, 2013).
In the semisupervised learning literature (Johnson & Zhang, 2007; Chen, Pan, Li, & Tang, 2013), the learning rate is essentially determined by the number of labeled data. Nevertheless, it does not mean that unlabeled data have no effect on the final result. In fact, the estimation of hypothesis error involves the unlabeled data, and some empirical theoretical results illustrate that the unlabeled data are helpful for improving learning performance. However, the effect on the learning rate is limited due to the fact that .
By theorem 15, a direct corollary can be obtained a:
We are now in a position to present the proofs of main results based on error decomposition, equation 3.4.
3.1 Estimation of Hypothesis Error
The following lemmas are useful for estimating hypothesis error.
3.2 Estimation of Sample Error
In this section, we focus on bounding the sample error. It should be noted that the estimation of sample error involves the sample and thus runs over a set of functions. Hence, we introduce the following two inequalities (Wu et al., 2007; Zhou & Jetter, 2006) to measure the uniform concentration estimate.
According to the definition of , by taking in equation 2.10, we can see . Hence, .
This letter investigates the generalization performance of Fredholm kernel regularized classifiers. Convergence analysis shows that the fast learning rate with can be reached under mild conditions for a family of classification algorithms with a Fredholm kernel. It will be interesting to explore fast optimization and a distributed framework for Fredholm kernel learning with big data.
Appendix: Proof of Representer Theorem
Two anonymous referees carefully read the manuscript for this letter and provided numerous constructive suggestions. As a result, the overall quality of the letter has been noticeably enhanced; we are much indebted to these referees and are grateful for their help. The research was partially supported by National 973 Programming (2013CB329404), the National Natural Science Foundation of China (11671161, 61673015, 11131006).