Abstract

In this letter, we consider a density-level detection (DLD) problem by a coefficient-based classification framework with -regularizer and data-dependent hypothesis spaces. Although the data-dependent characteristic of the algorithm provides flexibility and adaptivity for DLD, it leads to difficulty in generalization error analysis. To overcome this difficulty, an error decomposition is introduced from an established classification framework. On the basis of this decomposition, the estimate of the learning rate is obtained by using Rademacher average and stepping-stone techniques. In particular, the estimate is independent of the capacity assumption used in the previous literature.

1.  Introduction

The aim of this letter is to provide an error analysis for density level detection (DLD) by a classification method with -coefficient regularization and data-dependent hypothesis spaces. Our study is motivated by the increasing attention being paid to the classification framework for DLD (Steinwart, Hush, & Scovel, 2005; Scovel, Hush, & Steinwart, 2005; Cao, Xing, & Zhao, 2012) and error analysis with data-dependent hypothesis spaces (Wu, Ying, & Zhou, 2005; Wu & Zhou, 2008; Sun & Wu, 2011; Tong, Chen, & Yang, 2010; Shi, Feng, & Zhou, 2011; Xiao & Zhou, 2011; Feng & Lv, 2011; Song & Zhang, 2011).

Following the illustration in Steinwart et al. (2005), we introduce the preliminary background of the DLD problem in a Hilbert space. Let be a separable Hilbert space (possibly infinite dimensional), and let with for all and a positive constant B. Let Q be an unknown data-generating distribution on X. Usually we describe the data as anomalies if they are not concentrated (see Ripley, 1996; Scholköpf & Smola, 2002). A reference distribution on X is required to describe the concentration of Q. Assume that Q has a density h with respect to , that is, . Given a , we define the -level set of density h as . This set describes the concentration of Q. To define anomalies in terms of the concentration, one has only to fix a threshold so that a sample is considered to be anomalous whenever . We assume that is a -zero set, a common assumption used in many other papers (see, Polonik, 1995; Tsybakov, 1997). The goal of the DLD problem is to find the -level set as precisely as possible based on empirical data.

Steinwart et al. (2005) transformed the DLD problem into a binary classification problem. In this classification framework, the DLD algorithm can be reduced to learn a measurable function such that the set is a good estimate of the -level set. The approximation performance of f is measured by
formula
where denotes the symmetric difference.

Since the empirical comparison in terms of is difficult, Steinwart et al. (2005) proposed a novel performance measure. Let .

Definition 1. 
Let Q and be probability measures on X and . Then the probability measure on is defined by
formula
for all measurable subsets . Here denotes the indictor function: it assumes 1 if its argument is true and 0 otherwise.

From this definition, we know that can be associated with a binary classification problem where positive samples are drawn from sQ and negative samples are drawn from . On the basis of this interpretation, Steinwart et al. (2005) proposed the kernel method. Recall that is a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The candidate reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel K has been defined (see Aronszajn, 1950) as the closure of the linear span of the set of functions , equipped with the inner product defined by .

When is assumed to be positively labeled, drawn independently from Q, and infinitely negatively labeled samples are available by the knowledge of , Steinwart et al. (2005) considered the following empirical error:
formula
As Steinwart et al. (2005), pointed by the expectation can be numerically computed through finite evaluation of f on . Here are random sampled from . Then the empirical risk of f is defined as
formula
Steinwart et al. (2005) and Scovel et al. (2005) proposed the following regularized algorithm:
formula
1.1
where is a regularization parameter.
The convergence of equation 1.1 is well explained due to the work by Scovel et al. (2005) and Cao et al. (2012). Recently -coefficient regularization has attracted much attention in learning theory for classification and regression (Xiao & Zhou, 2011; Shi et al., 2011; Song & Zhang, 2011). The increasing interest stems mainly from the progress of the Lasso in statistics (Tibshirani, 1996) and compressive sensing (Chen, Donoho, & Saunders, 1998; Candès, Romberg, & Tao, 2006) where -regularization is able to yield sparse representation of the resulting minimizer. Inspired by these analyses, in this letter we consider a new coefficient-based regularized scheme for DLD defined by
formula
where
formula
1.2
is a regularization parameter,
formula

Our main goal is to establish the generalization error estimate of equation 1.2. Compared with the error analysis of algorithm 1.1 (Scovel et al., 2005; Cao et al., 2012), the following main theoretical contributions highlight three features of this letter:

  • • 

    Although the -regularizer has been successfully used for least square regression (Wu & Zhou, 2008; Xiao & Zhou, 2011; Shi et al., 2011; Song & Zhang, 2011) and linear programming SVM (Vapnik, 1998; Wu et al., 2005; Wu & Zhou, 2008), there are no such studies on the DLD problem. Our method enriches the algorithm design for the DLD problem.

  • • 

    The regularized part of equation 1.2 is essentially different from the regularized part of equation 1.1, and depends closely on empirical data. This leads to additional difficulty in the error analysis. A stepping-stone technique for the DLD problem is introduced to overcome the problem. We also note that all previous applications of the stepping-stone technique have been restricted to SVM and least square regression frameworks (Wu et al., 2005; Feng & Lv, 2011). Hence our hypothesis error estimate is novel and sheds some new light on applications of the stepping-stone technique.

  • • 

    It is worth noting that the sample error estimates in Scovel et al. (2005) and Cao et al. (2012) depend on the capacity measure with covering numbers. However, in some applications, input items are in the form of random functions (speech recordings, spectra, images), and this casts the DLD problem into the general class of functional data analysis (Ramsay & Silverman, 1997; Biau, Devroye, & Lugosi, 2008; Chen & Li, 2012). In these cases, the covering number assumption may be invalid since it usually depends on the dimension of the input data. The sample error estimate here is independent of covering numbers and suitable for the DLD problem where input terms belong to an infinite-dimensional separable Hilbert space.

The rest of this letter is organized as follows. In section 2, we introduce the necessary notations and definitions. The estimates for hypothesis error and sample error are presented in sections 3 and 4, respectively. The main result on learning rate is proved in section 5. We close with a brief conclusion in section 6.

2.  Preliminaries

In this section we introduce some definitions and notations used throughout this letter.

In the binary classification setting, the misclassification risk for a measurable function and a distribution P on is defined by
formula
where sign(f(x))≔1 if f(x)>0 and sign(f(x))≔−1 otherwise.

It is well known that the Bayes classifier fc=sign(2P(y=1|x)−1) minimizes the misclassification risk . For and , .

To establish the relationship between and the generalization error , we recall the following assumption (Steinwart et al., 2005; Scovel et al., 2005).

Definition 2. 
Let be a distribution on X, and let be a measurable function with , that is, h is a density with respect to . For and , we say h has -exponent q if there exists a constant C>0 such that for all t>0,
formula
The condition on h is closely related to the Tsybakov noise (see Tsybakov, 2004) for binary classification. Steinwart et al. (2005) proved that if h has -exponent q, then there exists a constant C>0 such that
formula
2.1
The inequality justifies using learning algorithms designed to minimize the risk function for the DLD problem. Due to the nonconvexity of the classification error function , the minimizer for the empirical setting of is typically NP-hard (Steinwart et al., 2005). To alleviate this computational problem, we introduce a convex upper loss (1−yf(x))+ to replace I. Then the expected risk with convex loss is defined by
formula
From theorem 2.1 in Zhang (2004) or theorem 9.21 in Cucker and Zhou (2007), we know that for every measurable function ,
formula
2.2
Inspired by the previous analysis technique with data-dependent hypothesis spaces (Wu et al., 2005; Wu & Zhou, 2008; Feng & Lv, 2011), we convert the theoretical analysis of algorithm 1.2 into a data-independent problem. Define the data-independent regularization function,
formula
2.3
We will show that plays a stepping-stone role between fT and .
From the definitions of and in equations 1.1, 1.2, and 2.3, respectively, we have
formula
2.4
where
formula
In learning theory, is called a sample error, is called a hypothesis error, and is called an approximation error.

The bounding technique for sample error usually relies on capacity measurement of the hypothesis function space . Note that the generalization performance of regression and clustering algorithms in Hilbert spaces has been investigated based on the Rademacher average (Biau et al., 2008; Chen & Li, 2012). In this letter, we also take Rademacher complexity (Bartlett & Mendelson, 2002) as the measure of capacity.

Definition 3. 
Let be a probability distribution on a set X and suppose that are independent samples selected according to a distribution. Let be a class of real-valued functions defined on X. The empirical Rademacher average of is defined by
formula
where are independent uniform-valued random variables. The Rademacher complexity ofis.

We adopt the following condition for the approximation error, which has been extensively used in the literature (Chen, Wu, Ying, & Zhou, 2004; Wu et al., 2005; Cucker & Zhou, 2007; Ying & Campbell, 2010; Cao et al., 2012).

Definition 4. 
We say the target function fc can be approximated with exponent in if there exists a constant , such that
formula

3.  Estimate of Hypothesis Error

Our estimate of hypothesis error is inspired by the stepping-stone techniques in Wu et al. (2005) and Feng and Lv (2011). The reproducing property is given by for any and . Learning algorithm 1.1 can be rewritten as the following quadratic programming optimization problem:
formula
To solve this optimization problem, we find the saddle point of the Lagrangian:
formula
The parameters that minimize the Lagrangian must satisfy
formula
From these conditions, one derives
formula
3.1
formula
3.2
formula
3.3

Based on equalities 3.1 to 3.3, we come to the following estimate of hypothesis error:

Proposition 1. 

For every , there holds .

Proof.
Observe that and . From equation 3.1, we argue that there exists at least a minimizer having the expression
formula
From equation 3.2, we derive that
formula
Therefore,
formula
In the same way, from equation 3.3 we have
formula
Moreover,
formula
This enables us to get the following inequality:
formula
Then
formula
where the last inequality follows from the facts
formula
and
formula

4.  Estimate of Sample Error

Now we turn to the estimate of sample error. McDiarmid’s inequality and some properties of Rademacher complexity (see Bartlett & Mendelson, 2002) are necessary in our estimate of .

Lemma 1. 

Let be classes of real functions. Then:

  1. , where.

  2. , where.

  3. Ifis Lipschitz with constantand satisfies, then.

To derive the upper bound of the sample error, we establish the concentration estimation of on the basis of the Rademacher average. The analysis technique used here is that of lemma 2.1 in Chen and Li (2012).

Lemma 2. 
Let and denote . Then, with probability at least , there holds
formula
Proof. 
Define and define . Note that
formula
4.1
Now we consider the two terms on the right side. For each , we have . Let be the same copy T+ with the kth sample replaced by sample . Then
formula
McDiarmid’s inequality (see Bartlett & Mendelson, 2002) implies that with probability at least ,
formula
4.2
Denote . By the standard symmetrization arguments (Bartlett & Mendelson, 2002) and lemma 1,
formula
4.3
Note that which satisfies the reproducing property. From the same analysis for lemma 2.2 (Chen & Li, 2012), we derive . Combining this with equation 4.3, we have with probability at least ,
formula
4.4
In the same way, from equation 3.3, we have that
formula
holds with probability at least . This, combined with equations 4.1 and 4.4, yields the desired result.

To bound , we need the upper bounds of fT and in RKHS:

Lemma 3. 

For fT defined in equation 1.2, and defined in equation 2.3, we have and .

Proof. 
The bound of can be derived directly from the definition of and . From the definition of fT in equation 1.2, we know . Then
formula
So the desired estimate of fT follows from the inequality .

Based on lemmas 2 and 3, the following estimate of sample error can be obtained directly:

Proposition 2. 
Let . Then with probability at least , there holds
formula

5.  Estimate of Learning Rate

We are now in a position to present the main result on the learning rate:

Theorem 1. 
Let . Let and Q be distributions on X such that Q has a density h with respect to . For , we write . Assuming that h has -exponent q, fc can be approximated with exponent . Then for and any , and choosing , we have, with confidence ,
formula
where C is a constant independent of.
Proof. 
The total error estimate can be derived by putting the results in propositions 1 and 2 into equation 2.4. For any , there exists a constant C1, independent of , such that
formula
holds true with probability at least .
From the condition of , we have with probability at least ,
formula
5.1
where C2 is a constant independent of .

Setting such that , we have . Next, choosing and satisfying , we derive . Connecting the choices of with equations 5.1 and 2.1, we obtain the desired estimate.

When , the learning rate of fT tends to . This polynomial decay is usually fast enough for a practical problem where a set of finite samples is available. Although the convergence rate is slower than the result in Cao et al. (2012), our result has two advantages compared with the previous result: first, the dual data-dependent characters (through and -regularizer) give us more adaptivity and flexibility to search for a sparse predictive function; second, the estimate is independent of the capacity assumption of a covering number, which is suitable for more general input data in separable Hilbert spaces.

As illustrated in Steinwart et al. (2005), the DLD algorithm, 1.1, can be considered a quadratic program SVM and implemented efficiently by LIPSVM software (Chang & Lin, 2004). Note that the -norm algorithm, equation 1.2, can be transformed as a linear programming SVM formulation (e.g., using the optimization technique in Fung & Mangasarian, 2004). Many experiments demonstrate that linear programming SVM is capable of solving huge sample-size problems and improving computation speed (see Bradley & Mangasarian, 2000; Pedroso & Murata, 2001). Since our main concern in this letter is to establish the approximation analysis of a coefficient-based DLD method, equation 1.2, we leave the empirical evaluation for future study.

6.  Conclusion

We considered the DLD problem with coefficient regularization 1.2. We introduced the stepping-stone technique and Rademacher average technique to establish the error analysis. We deduced the generalization error analysis and obtained a satisfactory estimate of learning rate. This provides a mathematical analysis for DLD with -regularizer.

Acknowledgments

The authors thank the referees for their valuable comments and helpful suggestions. This work was supported partially by the National Natural Science Foundation of China under Grants No. 11001092 and No. 11071058, the Fundamental Research Funds for the Central Universities (Program Nos. 2011PY130, 2011QC022), and by the Start-up Research of University of Macau under Grant No. SRG010-FST11-TYY, Multi-Year Research of University of Macau under Grants No. MYRG187(Y1-L3)-FST11-TYY and No. MYRG205(Y1-L4)-FST11-TYY.

References

Aronszajn
,
N.
(
1950
).
Theory of reproducing kernels
.
Trans. Amer. Math. Soc.
,
68
,
337
404
.
Bartlett
,
P. L.
, &
Mendelson
,
S.
(
2002
).
Rademacher and gaussian complexities: Risk bounds and structural results
.
J. Mach. Learn. Res.
,
3
,
463
482
.
Biau
,
G.
,
Devroye
,
L.
, &
Lugosi
,
G.
(
2008
).
On the performance of clustering in Hilbert spaces
.
IEEE Trans. Inf. Theory
,
54
,
781
790
.
Bradley
,
P. S.
, &
Mangasarian
,
O. L.
(
2000
).
Massive data discrimination via linear support vector machines
.
Optimization Methods and Software
,
13
,
1
10
.
Candès
,
E. J.
,
Romberg
,
J.
, &
Tao
,
T.
(
2006
).
Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information
.
IEEE Trans. Inf. Theory
,
52
,
489
509
.
Cao
,
F. L.
,
Xing
,
X.
, &
Zhao
,
J. W.
(
2012
).
Learning rates of support vector machine classifier for density level detection
.
Neurocomputing
,
82
,
84
90
.
Chang
,
C. C.
, &
Lin
,
C. J.
(
2004
).
LIBSVM: A library for support vector machines
.
Chen
,
D. R.
,
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D. X.
(
2004
).
Support vector machine soft margin classifiers: Error analysis
.
J. Mach. Learn. Res.
,
5
,
1143
1175
.
Chen
,
D. R.
, &
Li
,
H.
(
2012
).
On the performance of regularized regression learning in Hilbert space
.
Neurocomputing
,
93
,
41
47
.
Chen
,
S. S.
,
Donoho
,
D. L.
, &
Saunders
,
M. A.
(
1998
). Atomic decomposition by basis pursuit.
SIAM J. Sci. Comput.
,
20
,
33
61
.
Cucker
,
F.
, &
Zhou
,
D. X.
(
2007
).
Learning theory: An approximation theory viewpoint
.
Cambridge
:
Cambridge University Press
.
Feng
,
Y. L.
, &
Lv
,
S. G.
(
2011
).
Unified approach to coefficient-based regularized regression
.
Comput. Math. Appl.
,
62
,
506
515
.
Fung
,
G. M.
, &
Mangasarian
,
O. L.
(
2004
).
A feature selection Newton method for support vector machine classification
.
Comput. Optim. Appl.
,
28
,
185
202
.
Pedroso
,
J. P.
, &
Murata
,
N.
(
2001
).
Support vector machine with different norms: motivation, formulations and results
.
Pattern Recognition Letters
,
22
,
1263
1272
.
Polonik
,
W.
(
1995
).
Measuring mass concentrations and estimating density contour clusters—an excess mass approach
.
Ann. Statist.
,
23
,
855
881
.
Ramsay
,
J. O.
, &
Silverman
,
B. W.
(
1997
).
Functional data analysis
.
New York
:
Springer-Verlag
.
Ripley
,
B. D.
(
1996
).
Pattern recognition and neural networks
.
Cambridge
:
Cambridge University Press
.
Scholköpf
,
B.
, &
Smola
,
A. J.
(
2002
).
Learning with kernels
.
Cambridge, MA
:
MIT Press
.
Scovel
,
C.
,
Hush
,
D.
, &
Steinwart
,
I.
(
2005
).
Learning rates for density level detection
.
Anal. Appl.
,
3
,
356
371
.
Shi
,
L.
,
Feng
,
Y. L.
, &
Zhou
,
D. X.
(
2011
).
Concentration estimates for learning with -regularizer and data dependent hypothesis spaces
.
Appl. Comput. Harmon. Anal.
,
31
,
286
302
.
Song
,
G. H.
, &
Zhang
,
H. Z.
(
2011
).
Reproducing kernel Banach spaces with the norm II: Error analysis for regularized least square regression
.
Neural Computation
,
23
,
2713
2729
.
Steinwart
,
I.
,
Hush
,
D.
, &
Scovel
,
C.
(
2005
).
A classification framework for anomaly detection
.
J. Mach. Learn. Res.
,
6
,
211
232
.
Sun
,
H. W.
, &
Wu
,
Q.
(
2011
).
Least square regression with indefinite kernels and coefficient regularization
.
Appl. Comput. Harmon. Anal.
,
30
,
96
109
.
Tibshirani
,
R.
(
1996
).
Regression shrinkage and selection via the Lasso
.
J. Roy. Statist. Soc. Ser. B
,
58
,
267
288
.
Tong
,
H. Z.
,
Chen
,
D. R.
, &
Yang
,
F. H.
(
2010
).
Least square regression with -coefficient regularization
.
Neural Computation
,
38
,
526
565
.
Tsybakov
,
A. B.
(
1997
).
On nonparameter estimation of density level sets
.
Anna. Statist.
,
25
,
948
969
.
Tsybakov
,
A. B.
(
2004
).
Optimal aggregation of classiers in statistical learning
.
Anna. Statist.
,
32
,
135
166
.
Vapnik
,
V.
(
1998
).
Statistical learning theory
.
New York
:
Wiley
.
Wu
,
Q.
,
Ying
,
Y.
, &
Zhou
,
D. X.
(
2005
).
SVM soft margin classifiers: Linear programming versus quadratic programming
.
Neural Computation
,
17
,
1160
1187
.
Wu
,
Q.
, &
Zhou
,
D. X.
(
2008
).
Learning with sample dependent hypothesis spaces
.
Comput. Math. Appl.
,
56
,
2896
2907
.
Xiao
,
Q. W.
, &
Zhou
,
D. X.
(
2011
).
Learning by nonsymmetric kernels with data dependent spaces and - regularizer
.
Taiwanese J. Math.
,
14
,
1821
1836
.
Ying
,
Y.
, &
Campbell
,
C.
(
2010
).
Rademacher chaos complexities for learning the kernel problem
.
Neural Computation
,
22
,
2858
2886
.
Zhang
,
T.
(
2004
).
Statistical behavior and consistency of classification methods based on convex risk minimization
.
Anna. Statist.
,
32
,
56
85
.