## Abstract

In this letter, we consider a density-level detection (DLD) problem by a coefficient-based classification framework with -regularizer and data-dependent hypothesis spaces. Although the data-dependent characteristic of the algorithm provides flexibility and adaptivity for DLD, it leads to difficulty in generalization error analysis. To overcome this difficulty, an error decomposition is introduced from an established classification framework. On the basis of this decomposition, the estimate of the learning rate is obtained by using Rademacher average and stepping-stone techniques. In particular, the estimate is independent of the capacity assumption used in the previous literature.

## 1. Introduction

The aim of this letter is to provide an error analysis for density level detection (DLD) by a classification method with -coefficient regularization and data-dependent hypothesis spaces. Our study is motivated by the increasing attention being paid to the classification framework for DLD (Steinwart, Hush, & Scovel, 2005; Scovel, Hush, & Steinwart, 2005; Cao, Xing, & Zhao, 2012) and error analysis with data-dependent hypothesis spaces (Wu, Ying, & Zhou, 2005; Wu & Zhou, 2008; Sun & Wu, 2011; Tong, Chen, & Yang, 2010; Shi, Feng, & Zhou, 2011; Xiao & Zhou, 2011; Feng & Lv, 2011; Song & Zhang, 2011).

Following the illustration in Steinwart et al. (2005), we introduce the preliminary background of the DLD problem in a Hilbert space. Let be a separable Hilbert space (possibly infinite dimensional), and let with for all and a positive constant *B*. Let *Q* be an unknown data-generating distribution on *X*. Usually we describe the data as anomalies if they are not concentrated (see Ripley, 1996; Scholköpf & Smola, 2002). A reference distribution on *X* is required to describe the concentration of *Q*. Assume that *Q* has a density *h* with respect to , that is, . Given a , we define the -level set of density *h* as . This set describes the concentration of *Q*. To define anomalies in terms of the concentration, one has only to fix a threshold so that a sample is considered to be anomalous whenever . We assume that is a -zero set, a common assumption used in many other papers (see, Polonik, 1995; Tsybakov, 1997). The goal of the DLD problem is to find the -level set as precisely as possible based on empirical data.

*f*is measured by where denotes the symmetric difference.

Since the empirical comparison in terms of is difficult, Steinwart et al. (2005) proposed a novel performance measure. Let .

From this definition, we know that can be associated with a binary classification problem where positive samples are drawn from *sQ* and negative samples are drawn from . On the basis of this interpretation, Steinwart et al. (2005) proposed the kernel method. Recall that is a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The candidate reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel *K* has been defined (see Aronszajn, 1950) as the closure of the linear span of the set of functions , equipped with the inner product defined by .

*Q*, and infinitely negatively labeled samples are available by the knowledge of , Steinwart et al. (2005) considered the following empirical error: As Steinwart et al. (2005), pointed by the expectation can be numerically computed through finite evaluation of

*f*on . Here are random sampled from . Then the empirical risk of

*f*is defined as Steinwart et al. (2005) and Scovel et al. (2005) proposed the following regularized algorithm: where is a regularization parameter.

Our main goal is to establish the generalization error estimate of equation 1.2. Compared with the error analysis of algorithm 1.1 (Scovel et al., 2005; Cao et al., 2012), the following main theoretical contributions highlight three features of this letter:

- •
Although the -regularizer has been successfully used for least square regression (Wu & Zhou, 2008; Xiao & Zhou, 2011; Shi et al., 2011; Song & Zhang, 2011) and linear programming SVM (Vapnik, 1998; Wu et al., 2005; Wu & Zhou, 2008), there are no such studies on the DLD problem. Our method enriches the algorithm design for the DLD problem.

- •
The regularized part of equation 1.2 is essentially different from the regularized part of equation 1.1, and depends closely on empirical data. This leads to additional difficulty in the error analysis. A stepping-stone technique for the DLD problem is introduced to overcome the problem. We also note that all previous applications of the stepping-stone technique have been restricted to SVM and least square regression frameworks (Wu et al., 2005; Feng & Lv, 2011). Hence our hypothesis error estimate is novel and sheds some new light on applications of the stepping-stone technique.

- •
It is worth noting that the sample error estimates in Scovel et al. (2005) and Cao et al. (2012) depend on the capacity measure with covering numbers. However, in some applications, input items are in the form of random functions (speech recordings, spectra, images), and this casts the DLD problem into the general class of functional data analysis (Ramsay & Silverman, 1997; Biau, Devroye, & Lugosi, 2008; Chen & Li, 2012). In these cases, the covering number assumption may be invalid since it usually depends on the dimension of the input data. The sample error estimate here is independent of covering numbers and suitable for the DLD problem where input terms belong to an infinite-dimensional separable Hilbert space.

The rest of this letter is organized as follows. In section 2, we introduce the necessary notations and definitions. The estimates for hypothesis error and sample error are presented in sections 3 and 4, respectively. The main result on learning rate is proved in section 5. We close with a brief conclusion in section 6.

## 2. Preliminaries

In this section we introduce some definitions and notations used throughout this letter.

It is well known that the Bayes classifier *f _{c}*=sign(2

*P*(

*y*=1|

*x*)−1) minimizes the misclassification risk . For and , .

To establish the relationship between and the generalization error , we recall the following assumption (Steinwart et al., 2005; Scovel et al., 2005).

*h*is closely related to the Tsybakov noise (see Tsybakov, 2004) for binary classification. Steinwart et al. (2005) proved that if

*h*has -exponent

*q*, then there exists a constant

*C*>0 such that The inequality justifies using learning algorithms designed to minimize the risk function for the DLD problem. Due to the nonconvexity of the classification error function , the minimizer for the empirical setting of is typically NP-hard (Steinwart et al., 2005). To alleviate this computational problem, we introduce a convex upper loss (1−

*yf*(

*x*))

_{+}to replace

*I*. Then the expected risk with convex loss is defined by

*f*and .

_{T}The bounding technique for sample error usually relies on capacity measurement of the hypothesis function space . Note that the generalization performance of regression and clustering algorithms in Hilbert spaces has been investigated based on the Rademacher average (Biau et al., 2008; Chen & Li, 2012). In this letter, we also take Rademacher complexity (Bartlett & Mendelson, 2002) as the measure of capacity.

*Let be a probability distribution on a set*where

*X*and suppose that are independent samples selected according to a distribution. Let be a class of real-valued functions defined on*X*. The empirical Rademacher average of is defined by*are independent uniform*-

*valued random variables. The Rademacher complexity of*

*is*.

## 3. Estimate of Hypothesis Error

*For every , there holds *.

## 4. Estimate of Sample Error

Now we turn to the estimate of sample error. McDiarmid’s inequality and some properties of Rademacher complexity (see Bartlett & Mendelson, 2002) are necessary in our estimate of .

*Let be classes of real functions. Then:*

,

*where*.,

*where*.*If**is Lipschitz with constant**and satisfies*,*then*.

To derive the upper bound of the sample error, we establish the concentration estimation of on the basis of the Rademacher average. The analysis technique used here is that of lemma 2.1 in Chen and Li (2012).

*T*

^{+}with the

*k*th sample replaced by sample . Then McDiarmid’s inequality (see Bartlett & Mendelson, 2002) implies that with probability at least , Denote . By the standard symmetrization arguments (Bartlett & Mendelson, 2002) and lemma 1, Note that which satisfies the reproducing property. From the same analysis for lemma 2.2 (Chen & Li, 2012), we derive . Combining this with equation 4.3, we have with probability at least ,

To bound , we need the upper bounds of *f _{T}* and in RKHS:

*f*in equation 1.2, we know . Then So the desired estimate of

_{T}*f*follows from the inequality .

_{T}Based on lemmas 2 and 3, the following estimate of sample error can be obtained directly:

## 5. Estimate of Learning Rate

We are now in a position to present the main result on the learning rate:

*C*

_{1}, independent of , such that holds true with probability at least .

*C*

_{2}is a constant independent of .

Setting such that , we have . Next, choosing and satisfying , we derive . Connecting the choices of with equations 5.1 and 2.1, we obtain the desired estimate.

When , the learning rate of *f _{T}* tends to . This polynomial decay is usually fast enough for a practical problem where a set of finite samples is available. Although the convergence rate is slower than the result in Cao et al. (2012), our result has two advantages compared with the previous result: first, the dual data-dependent characters (through and -regularizer) give us more adaptivity and flexibility to search for a sparse predictive function; second, the estimate is independent of the capacity assumption of a covering number, which is suitable for more general input data in separable Hilbert spaces.

As illustrated in Steinwart et al. (2005), the DLD algorithm, 1.1, can be considered a quadratic program SVM and implemented efficiently by LIPSVM software (Chang & Lin, 2004). Note that the -norm algorithm, equation 1.2, can be transformed as a linear programming SVM formulation (e.g., using the optimization technique in Fung & Mangasarian, 2004). Many experiments demonstrate that linear programming SVM is capable of solving huge sample-size problems and improving computation speed (see Bradley & Mangasarian, 2000; Pedroso & Murata, 2001). Since our main concern in this letter is to establish the approximation analysis of a coefficient-based DLD method, equation 1.2, we leave the empirical evaluation for future study.

## 6. Conclusion

We considered the DLD problem with coefficient regularization 1.2. We introduced the stepping-stone technique and Rademacher average technique to establish the error analysis. We deduced the generalization error analysis and obtained a satisfactory estimate of learning rate. This provides a mathematical analysis for DLD with -regularizer.

## Acknowledgments

The authors thank the referees for their valuable comments and helpful suggestions. This work was supported partially by the National Natural Science Foundation of China under Grants No. 11001092 and No. 11071058, the Fundamental Research Funds for the Central Universities (Program Nos. 2011PY130, 2011QC022), and by the Start-up Research of University of Macau under Grant No. SRG010-FST11-TYY, Multi-Year Research of University of Macau under Grants No. MYRG187(Y1-L3)-FST11-TYY and No. MYRG205(Y1-L4)-FST11-TYY.