Abstract

Recently, a new framework, Fredholm learning, was proposed for semisupervised learning problems based on solving a regularized Fredholm integral equation. It allows a natural way to incorporate unlabeled data into learning algorithms to improve their prediction performance. Despite rapid progress on implementable algorithms with theoretical guarantees, the generalization ability of Fredholm kernel learning has not been studied. In this letter, we focus on investigating the generalization performance of a family of classification algorithms, referred to as Fredholm kernel regularized classifiers. We prove that the corresponding learning rate can achieve ( is the number of labeled samples) in a limiting case. In addition, a representer theorem is provided for the proposed regularized scheme, which underlies its applications.

1  Introduction

Many scientific problems (e.g., regression and classification) come down to learning a prediction rule from the given finite input-output samples. Kernel tricks and methods based on integral operators provide powerful tools for learning tasks and have become central to machine learning. In order to construct a good predictor, one usually chooses a function from a class of functions (hypothesis space) using regularized learning schemes associated with certain loss functions.

Regularized kernel learning has attracted much attention due to its solid theoretical foundations and successful practical applications. Recently a new kernel learning framework, Fredholm learning, has been proposed by reformulating the learning problem as a regularized Fredholm integral equation (Que, Belkin & Wang, 2014; Que & Belkin, 2013). This framework allows a way to incorporate unlabeled data into learning algorithms and can be interpreted as a special form of kernel method with a data-dependent kernel—the Fredholm kernel. It has been shown that the Fredholm classification algorithm can reduce the variance of kernel function evaluations at data noise and improve the prediction accuracy and robustness of kernel methods (Que et al., 2014).

Despite rapid progress on theoretical and empirical evaluations, the generalization performance of Fredholm kernel learning remains unknown. This letter makes efforts to answer the question. Specifically, we focus on error analysis for a family of classification algorithms, Fredholm kernel regularized classifiers, and establishing the corresponding generalization error bound. We show that the learning rate of Fredholm kernel regularized classifiers can achieve under mild conditions. In addition, we also justify its representer theorem, which makes the solution of Fredholm kernel learning model computation accessible.

The rest of the letter is organized as follows. Section 2 presents basic definitions and some necessary background. Section 3 focuses on establishing the generalization bounds for Fredholm kernel regularized classifiers. We conclude in Section 4.

2  Preliminaries

2.1  Classification in Learning Theory

We begin with a brief review of a binary classification problem. Let be a compact metric space and be the output space. A classifier is a map that makes a prediction for each . The mapping relationship between the input and the output can be modeled by a probability on . Suppose admits a decomposition in which denotes a marginal probability measure on and denotes a condition probability measure (given ) on . Then the prediction ability of a classifier can be measured by a misclassification error, defined as the probability of incorrect prediction
formula
2.1
Define the regression function
formula
where . The best classifier that minimizes the misclassification error is the Bayes rule, given by
formula
2.2
To formulate the error analysis for Fredholm kernel regularized classifiers, we assume the loss function satisfies the following condition:
Definition 1.

We say that is a normalized classification loss function if it is convex, differentiable at 0 with , and the smallest zero of is 1.

Typical examples of classification loss include hinge loss for support vector machine (SVM) for SVM -norm () soft margin classifier, and least square loss .

Define the expected risk associated with as
formula
2.3
However, it cannot be computed directly since is usually unknown. Its discretization is therefore often used, which, computable in terms of finite samples , is defined as
formula
2.4
and called empirical risk. Regularized learning schemes, implemented by minimizing a penalized version of empirical risk, aim to find a good approximation of the Bayesian rule. It is expected that minimization can always be taken over a set of functions, known as hypothesis space, which is usually selected as a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel. Such a kernel is a continuous, symmetric function on such that the matrix is positive semidefinite for any . It is well known that a specific Mercer kernel corresponds to a unique RKHS (Aronszajn, 1950) with norm . The reproducing property takes the form , , . The standard regularized classifier is defined by
formula
2.5
where is regularization parameter and controls a trade-off between empirical risk and the regularization term.

2.2  Fredholm Learning Framework

Let be labeled pairs from distribution and be the unlabeled data points from marginal distribution . The goal of semisupervised classification is to construct a reliable classifier by incorporating the information of labeled and unlabeled data together. To this end, we introduce an integral operator associated with a kernel function with , defined by
formula
2.6
where denotes the density function of and the square integrable function space. Generally, by the law of large numbers, can be approximated by unlabeled data from as
formula
2.7
The target of the Fredholm learning framework is then to solve the following optimization problem:
formula
2.8
The final classifier is . Notice that equation 2.8 can be considered as an empirical regularized version of the Fredholm integral equation . This is why such a learning scheme is called the Fredholm learning framework. In this framework, the intrinsic hypothesis space is density dependent.

At first glance, the Fredholm learning framework defined in equation 2.8 looks similar to standard regularized learning, equation 2.5. However, makes a significant difference. The density-dependent hypothesis space enables us to integrate the information from unlabeled data. In contrast, most kernels used in a traditional kernel learning framework—for example, linear kernel, gaussian kernel, and polynomial kernel—are completely independent of data distribution. In particular, when setting the kernel to be -function, formulation 2.8 is reduced to the standard regularized learning framework.

In addition, the solution of the optimization problem, equation 2.8, is computationally accessible and benefits from the well-known representer theorem in RKHS. That theorem (Que et al., 2014) allows us to transform equation 2.8 into quadratic optimization in a finite dimensional space. The solution can be represented as
formula
where for , and for . Let be the matrix associated with a new kernel, named the Fredholm kernel, defined by
formula
2.9
Then the solution of equation 2.8 can be rewritten as
formula
Equation 2.9 involves the “inner” kernel and the “outer” kernel . It can be proved that the Fredholm kernel defined in equation 2.9 is positive semidefinite if is a positive semidefinite kernel. Note that it is not necessary for the outer kernel to be positive definite or even symmetric, and it can be selected flexibly based on the user’s preferences.
Inspired by the Fredholm kernel (Que et al., 2014), we propose a family of Fredholm kernel regularized classifiers given by
formula
2.10
where is a regularization parameter.

The main goal of this letter is to investigate the generalization performance of equation 2.10. Specifically, we expect to give an explicit convergence rate for Fredholm kernel regularized classifiers under some mild conditions. The following proposition states the representer theorem for Fredholm kernel regularized classifiers:

Proposition 1.
Assume is defined in equation 2.7, and is a classification loss. Then the solution of equation 2.10 is of the form
formula
for some .

The proof of proposition 2 is similar to the analysis of a representer theorem for kernel methods (e.g., Belkin & Niyogi, 2006). We provide its proof in the appendix.

3  Bounds of Generalization Error

The generalization analysis aims at bounding the misclassification error . Nevertheless, the algorithm is constructed by minimizing regularized empirical error associated with the loss function . Hence it seems necessary to build a bridge between the excess misclassification error and the excess convex risk. Fortunately, researchers have established comparison theorems to solve this problem (Bartlett & Mendelson, 2002; Zhang, 2004; Chen, Wu, Yin & Zhou, 2004; Wu & Zhou, 2005). Here we mention some results that will be used in this letter.

Lemma 1.
(Chen et al., 2004). If an activating loss satisfies , then there exists a constant such that for any measurable function , there holds
formula

Furthermore, we can get tight comparison bounds when the distribution satisfies the Tsybakov noise condition (Tsybakov, 2004).

Definition 2.
We say that has a Tsybakov noise exponent if there exists a constant for every measurable function :
formula
3.1

Note that all the distributions satisfy equation 3.1 with and . Tsybakov (2004) considered the convergence rate of the risk of a function that minimizes empirical risk over a fixed class and demonstrated that a fast convergence rate can be achieved under the Tsybakov noise condition. We assume that satisfies the Tsybakov condition to obtain a fast convergence rate.

Lemma 2
(Wu, Ying, & Zhou, 2007). Let classification loss satisfy . If satisfies the Tsybakov noise condition, equation 3.1, for some and , then
formula

Since Fredholm kernel regularized classifiers are obtained by composing the sgn function with a real-value function , we expect to improve the error estimates by projecting the estimator into .

Definition 3.
The projection operator is defined on the space of measurable functions as
formula

It is easy to check that . A well-developed approach for conducting a generalization analysis of the regularization algorithm in RKHS is error decomposition, which allows the excess generalization error to be decomposed into sample error and approximation error (Zhou, 2002; Cucker & Smale, 2001). In the Fredholm learning framework, we formulate error decomposition in a similar way by introducing a data-independent regularized function. We first introduce some conditions on the capacity of hypothesis space and the approximation ability of Fredholm learning framework. The covering number (Zhou, 2002, 2003; Shi, Feng, & Zhou, 2011) is used to describe the capacity of a function space.

Definition 4.
Let be a pseudometric space and . For every , the covering number of with respect to and is defined as the minimal number of balls of radius whose union covers , that is,
formula
where is a ball in .
Definition 5.
Let be a class of functions on sample set . The -metric is defined on by
formula
For every , the covering number of with -metric is defined as
formula
and the covering number of with -metric is denoted by . Note that for any function set , there exists .
For , denote
formula
and
formula
Assumption 1
(capacity condition). For the inner kernel and outer kernel , there exist positive constants and such that for any ,
formula
where are positive constants independent of .
The data-independent regularized function is defined by
formula
3.2
Denote
formula
3.3
Then the approximation ability of the Fredholm kernel learning scheme can be characterized by
formula
Assumption 2
(approximation condition). There exists a constant such that
formula
where is a constant independent of .
Assumption 3.
Suppose distribution satisfies Tsybakov noise condition 3.1 and there exists some and a constant such that
formula

Following the ideas in Wu and Zhou (2005, 2008), we obtain the following error decomposition:

Proposition 2.
Let be defined as in equation 2.10 with sample set and . Then
formula
3.4
where
formula
are named sample error, approximation error, and hypothesis error, respectively.
Proof.
A direct computation shows that
formula
Then the conclusion holds true since both the second term and the last term of the above equality are less than 0.
Theorem 1.
Let be a normalized classification loss, and suppose it satisfies increment condition (, ). Under assumptions 9, 10, and 11, the following inequality,
formula
holds with probability at least , where is a constant independent of .
Remark 1.

It can be observed from theorem 13 that the generalization bound relies on the capacity condition, the approximation condition, and the choice of regularization parameter . Specifically, the labeled data play a key role on the generalization bound without the extra assumption on marginal distribution, which is consistent with the theoretical analysis for semisupervised learning (Belkin & Niyogi, 2006; Chen, Zhou, Tang, Li, & Pan, 2013).

Theorem 2.
With the same conditions in theorem 13, taking , we get
formula
3.5
which holds with probability at least , where
formula
3.6
and is a constant independent to , , and .
Remark 2.
Let and . Then it can be observed that the learning rate will be arbitrarily close to for sufficient small and , which is regarded as the optimal learning rate in theory. We give two examples to show how the distribution and variance-expectation influence the learning rate. When , if satisfies the Tsybakov noise condition, equation 3.1, then we know that assumption 11 is valid with and the constant (Steinwart & Scovel, 2005). When , a direct computation implies
formula
Assumption 11 holds by taking and (Lee, Bartlett, & Williamson, 1996). Theorem 15 illustrates that Fredholm kernel regularized classifiers inherit the convergence characteristic of a standard kernel-based regularization classification algorithm.
Remark 3.

In the semisupervised learning literature (Johnson & Zhang, 2007; Chen, Pan, Li, & Tang, 2013), the learning rate is essentially determined by the number of labeled data. Nevertheless, it does not mean that unlabeled data have no effect on the final result. In fact, the estimation of hypothesis error involves the unlabeled data, and some empirical theoretical results illustrate that the unlabeled data are helpful for improving learning performance. However, the effect on the learning rate is limited due to the fact that .

By theorem 15, a direct corollary can be obtained a:

Corollary 1.
With the same conditions in theorem 15, for any , there exists a constant independent of such that the following inequality,
formula
holds with confidence at least , where and is given by equation 3.6. In addition, if satisfies the Tsybakov noise condition with , the power can be improved to .

We are now in a position to present the proofs of main results based on error decomposition, equation 3.4.

3.1  Estimation of Hypothesis Error

The following lemmas are useful for estimating hypothesis error.

Lemma 3
(Smale & Zhou, 2007). Let be a Hilbert space and be a random variable on with values in . Assume that almost surely. Denote . Then for any given independent and identically distributed samples and any , there holds
formula
with confidence at least .
Lemma 4.
Let be a normalized classification loss function. Then
formula
Proof.
Define a univariate convex function for as
formula
It is easy to check that the one-side derivatives of exist, are nondecreasing, and satisfy for every . Denote
formula
Then for each , the univariate function is strictly decreasing in and strictly increasing in . For , we have . Hence, is a constant, which is the minimal value on .
With the function , we can rewrite as
formula
Since and , we derive and for . Let for . The only thing we need to prove is that
formula
holds for those with . By the deduction conducted above, such a point must satisfy .
, which means that is strictly increasing on ; hence, . Then we have
formula
Because is convex, the one-side derivatives and exist, are nondecreasing, and satisfy for every . Consider
formula
Then
formula
For , is strictly decreasing on . Hence, , and we have
formula
Because
formula
we find that the previous equation still holds.
Proposition 3.
For any , with confidence at least , we have
formula
3.7
Proof.
According to lemma 20, we have
formula
Let , which is a continuous and bounded variable on . Then
formula
and
formula
Since , and applying lemma 19 to random variable , we obtain
formula

3.2  Estimation of Sample Error

In this section, we focus on bounding the sample error. It should be noted that the estimation of sample error involves the sample and thus runs over a set of functions. Hence, we introduce the following two inequalities (Wu et al., 2007; Zhou & Jetter, 2006) to measure the uniform concentration estimate.

Lemma 5.
Let be a set of measurable functions on . Assume that there are constants and such that and for every . If for some and , for any , then there exists a constant such that
formula
holds with confidence at least , where
formula
Lemma 6.
Let , , and be a set of functions on such that for every , , and . Then for any ,
formula

According to the definition of , by taking in equation 2.10, we can see . Hence, .

Proposition 4.
Suppose normalized classifying loss satisfies the increment condition with exponent and a constant , . Under assumptions 9, 10, and 11, for any , with confidence at least , there holds
formula
3.8
where
formula
and
formula
Proof.
We divide the sample error into two parts,
formula
where , .
We first estimate . Denote . According to the definition of , we know that for , with . Let
formula
since is restricted in . Then we have and . Accordingly,
formula
In connection with assumption 9, this means that
formula
By lemma 23, we obtain
formula
Assume is the positive solution of the following equation:
formula
Then one can check that is strictly decreasing. Hence, if .
Denote . If we take to be a positive number that satisfies and
formula
we have .
Note that the inequality satisfied by can be written as
formula
According to lemma 7 (Cucker & Smale, 2002), we can choose
formula
which indicates that
formula
Applying lemma 23 with , we know that for all ,
formula
Recall the elementary inequality:
formula
Using this for , , , we can find
formula
By the fact that
formula
we get
formula
3.9
We now focus on estimating . Let , and . Then
formula
Since , the monotonicity of tells us that
formula
Let . Then for any , there exists
formula
which implies that
formula
Assumption 11 tells us that . Then by lemma 22, the following holds with confidence at least ,
formula
3.10
where
formula
Combining equations 3.9 and 3.10, we obtain the desired results.
Proof of Theorem 1.
Combining propositions 12, 21, and 24, with confidence at least , we have
formula
3.11
where is given in proposition 24. When , with confidence at least , there holds
formula
When , we find equation 3.11 still holds by the elementary inequality: if , then
formula
Considering with and setting , and , the following inequality holds,
formula
with probability at least , where is a positive constant independent of .
Proof of Theorem 2.
Theorem 13 tells us that with confidence ,
formula
Since , without loss of generality, assume that and let
formula
Taking , we can verify that
formula
Hence, with confidence , we have
formula

4  Conclusion

This letter investigates the generalization performance of Fredholm kernel regularized classifiers. Convergence analysis shows that the fast learning rate with can be reached under mild conditions for a family of classification algorithms with a Fredholm kernel. It will be interesting to explore fast optimization and a distributed framework for Fredholm kernel learning with big data.

Appendix: Proof of Representer Theorem

Here we focus on only the case for ; the proofs for the other cases are similar. Define the empirical loss for the learning problem:
formula
We first project onto the subspace of , , which is spanned by a kernel function centered at the data points. Then we can obtain the orthogonal decomposition,
formula
where is the component along the subspace and is the component orthogonal to the subspace. By definition, for . Thus, the empirical loss can be expressed as
formula
Hence, the orthogonal component of does not serve any function in empirical risk function. For the regularization term, since
formula
then . By combining the results above, we have
formula
which implies that is minimized if lies in the subspace and . The conclusion holds true.

Acknowledgments

Two anonymous referees carefully read the manuscript for this letter and provided numerous constructive suggestions. As a result, the overall quality of the letter has been noticeably enhanced; we are much indebted to these referees and are grateful for their help. The research was partially supported by National 973 Programming (2013CB329404), the National Natural Science Foundation of China (11671161, 61673015, 11131006).

References

Aronszajn
(
1950
).
Theory of reproducing kernels
.
Trans. Amer. Math. Soc.
,
68
,
337
404
.
Bartlett
,
P.
, &
Mendelson
,
S.
(
2002
).
Rademacher and gaussian complexities: Risk bounds and structural results
.
J. Mach. Learn. Res.
,
3
,
463
482
.
Belkin
,
M.
, &
Niyogi
,
P.
(
2006
).
Manifold regularization: A geometric framework for learning from labeled and unlabeled examples
.
J. Mach. Learn. Res.
,
7
,
2399
2434
.
Chen
,
D.
,
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D.
(
2004
).
Support vector machine soft margin classifiers: Error analysis
.
J. Mach. Learn. Res.
,
5
,
1143
1175
.
Chen
,
H.
,
Pan
,
Z.
,
Li
,
L.
, &
Tang
Y.
, (
2013
).
Learning rates of coefficient-based regularized classifier for density level detection
.
Neural Comput.
,
25
,
1107
1121
.
Chen
,
H.
,
Zhou
,
Y.
,
Tang
,
Y.
,
Li
,
L.
, &
Pan
,
Z.
(
2013
).
Convergence rate of semisupervised greedy algorithm
.
Neural Networks
,
44
,
44
50
.
Cucker
,
F.
, &
Smale
,
S.
(
2001
).
On the mathematical foundations of learning
.
Bull. Amer. Math. Soc.
,
39
,
1
49
.
Cucker
,
F.
, &
Smale
,
S.
(
2002
).
Best choices for regularization parameters in learning theory: On the bias-variance problem
.
Found. Comput. Math.
,
1
,
413
428
.
Johnson
,
R.
, &
Zhang
,
T.
(
2007
).
On the effectiveness of Laplacian normalization for graph semi-supervised learning
.
J. Mach. Learn. Res.
,
8
,
1489
1517
.
Lee
,
W.
,
Bartlett
,
P.
, &
Williamson
,
R.
(
1996
).
Efficient agnostic learning of neural networks with bound fan-in
.
IEEE. Trans. Inf. Theory
,
42
,
2118
2132
.
Que
,
Q.
, &
Belkin
,
M.
(
2013
).
Inverse density as an inverse problem: The Fredholm equation approach
. In
C. J. C.
Burges
,
L.
Bolou
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
.
Red Hook, NY
:
Curran
.
Que
,
Q.
,
Belkin
,
M.
, &
Wang
,
Y.
(
2014
).
Learning with Fredholm kernels
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
.
Red Hook, NY
:
Curran
.
Shi
,
L.
,
Feng
,
Y.
, &
Zhou
,
D.
(
2011
).
Concentration estimates for learning with -regularizer and data dependent hypothesis space
.
Appl. Comput. Harmonic. Anal.
,
31
,
286
302
.
Smale
,
S.
, &
Zhou
,
D.
(
2007
).
Learning theory estimates via integral operators and their approximations
.
Constr. Approx.
,
26
,
153
172
.
Steinwart
,
I.
, &
Scovel
,
C.
(
2005
).
Fast rates for support vector machines
. In
Proceedings of the 18th Conference on Learning Theory.
New York
:
Springer
.
Tsybakov
,
A.
(
2004
).
Optimal aggression of classifiers in statistical learning
.
Ann. Statist.
,
32
,
135
166
.
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D.
(
2007
).
Multi-kernel regularized classifiers
.
J. Complexity
,
23
,
108
134
.
Wu
,
Q.
, &
Zhou
,
D.
(
2005
).
SVM soft margin classifiers: Linear programming versus quadratic programming
.
Neural Comp.
,
17
,
1160
1187
.
Wu
,
Q.
, &
Zhou
,
D.
(
2008
).
Learning with sample dependent hypothesis spaces
.
Comput. Math. Appl.
,
56
,
2896
2907
.
Zhang
,
T.
(
2004
).
Statistical behavior and consistency of classification methods based on convex risk minimization
.
Ann. Statist.
,
32
,
56
134
.
Zhou
,
D.
(
2002
).
The covering number in learning theory
.
J. Complexity
,
18
,
739
767
.
Zhou
,
D.
(
2003
).
Capacity of reproducing kernel space in learning theory
.
IEEE Trans. Inf. Theory
,
49
,
1743
1752
.
Zhou
,
D.
, &
Jetter
,
K.
(
2006
).
Approximation with polynomial kernels and SVM classifiers
.
Adv. Comput. Math.
,
25
,
323
344
.