## Abstract

The k-dimensional coding schemes refer to a collection of methods that attempt to represent data using a set of representative k-dimensional vectors and include nonnegative matrix factorization, dictionary learning, sparse coding, k-means clustering, and vector quantization as special cases. Previous generalization bounds for the reconstruction error of the k-dimensional coding schemes are mainly dimensionality-independent. A major advantage of these bounds is that they can be used to analyze the generalization error when data are mapped into an infinite- or high-dimensional feature space. However, many applications use finite-dimensional data features. Can we obtain dimensionality-dependent generalization bounds for k-dimensional coding schemes that are tighter than dimensionality-independent bounds when data are in a finite-dimensional feature space? Yes. In this letter, we address this problem and derive a dimensionality-dependent generalization bound for k-dimensional coding schemes by bounding the covering number of the loss function class induced by the reconstruction error. The bound is of order , where m is the dimension of features, k is the number of the columns in the linear implementation of coding schemes, and n is the size of sample, when n is finite and when n is infinite. We show that our bound can be tighter than previous results because it avoids inducing the worst-case upper bound on k of the loss function. The proposed generalization bound is also applied to some specific coding schemes to demonstrate that the dimensionality-dependent bound is an indispensable complement to the dimensionality-independent generalization bounds.

## 1  Introduction

The k-dimensional coding schemes (Maurer & Pontil, 2010) are abstract and general descriptions of a collection of methods, all of which encode a data point as a representative vector by a linear map T, where denotes the Hilbert space. These coding schemes can be formulated as follows:
1.1
where is called the codebook and the linear map is called the implementation of the codebook. The implementation projects the codebook back to the data source space. The dimension of a data point x can be either finite or infinite. In this letter, we consider the data as having finite dimensions of features, that is, .
Each data point in can be exactly or approximately reconstructed by a code y in the codebook. The reconstruction error of a data point x is defined as
1.2
The function , whose variables are x and T, is also called the loss function. Nonnegative matrix factorization (NMF) (Lee & Seung, 1999; Févotte, Bertin, & Durrieu, 2009; Guan, Tao, Luo, & Yuan, 2012), dictionary learning (Chen, Donoho, & Saunders, 1999; Ivana & Pascal, 2011), sparse coding (Olshausen & Field, 1996; Amiri & Haykin, 2014; Gui, Sun, Ji, Tao, & Tan, 2016), k-means clustering (MacQueen, 1967; Anderberg, 1973; Liu, Liu, Wu, Tao, & Fu, 2015), and vector quantization (Gray, 1984; Schneider, Biehl, & Hammer, 2009a) are specific forms of k-dimensional coding schemes because they share the same form of the reconstruction error as equation 1.2. They have achieved great successes in the fields of pattern recognition and machine learning for their superior performances on a broad spectrum of applications (Liu, Shao, Li, & Fu, 2016; Pehlevan, Hu, & Chklovskii, 2015; Gong, Zhang, Schölkopf, Tao, & Geiger, 2015; Liu, Wu, Tao, Zhang, & Fu, 2015; Liu, Tao, Cheng, & Tang, 2014; Mairal, Bach, & Ponce, 2012; Hunt, Ibbotson, & Goodhill, 2012; Wright, Yang, Ganesh, Sastry, & Ma, 2009; Schneider, Biehl, & Hammer, 2009b; Dhillon, Guan, & Kulis, 2007; Quiroga, Nadasdy, & Ben-Shaul, 2004; Kanungo et al., 2002; Abbott & Dayan, 1999).
Any coding scheme should find a proper implementation T. A natural choice for T is the one that minimizes the expected reconstruction error. The expected reconstruction error with respect to the implementation T is defined as
1.3
where is a Borel measure of the data source and is the probability density function. However, in most cases, is unknown, and cannot be directly minimized. An alternative approach is the empirical risk minimization (ERM) method (Vapnik, 2000; Cucker & Smale, 2002). Given a finite number of independent and identically distributed (i.i.d.) observations , the empirical reconstruction error with respect to T is defined as
1.4
The ERM method searches for a Tn that minimizes , and in the hope that has a small distance to the expected reconstruction error , where
1.5
and denotes a particular class of linear operators T.
A probabilistic bound on the defect
is called the generalization (error) bound. This letter focuses on this error bound in the framework of k-dimensional coding schemes. Although different restrictions are imposed on the choices of and Y for different concrete forms of k-dimensional coding schemes (e.g., NMF requires both and Y to be nonnegative, and sparse coding requires sparsity in Y), they are closely related. For example, Ding, He, and Simon (2005) showed that NMF with orthogonal is identical to k-means clustering of . Analyzing the generalization bounds together in this context has the advantages of exploiting the common properties and mutual cross-fertilization.

### 1.1  Related Work

Maurer and Pontil (2010) and Gribonval, Jenatton, Bach, Kleinsteuber, and Seibert (2015) have performed the only known theoretical analyses on the generalization error in the framework of k-dimensional coding schemes. Other work has concentrated only on specific k-dimensional coding schemes. Since some previous work has studied consistency performance, which considers the quantity of the related ERM-based algorithms, we demonstrate the relationship between the generalization error and consistency performance here:
1.6
Thus, analyzing the generalization error provides an approach for analyzing the consistency performance, and the consistency performance provides directions to generalization error analysis. We review the generalization error and consistency performance of k-dimensional coding schemes together:
• Nonnegative matrix factorization (NMF). The only known generalization bounds of NMF have been developed by Maurer and Pontil (2010) and Gribonval et al. (2015).

• Dictionary learning. Maurer and Pontil (2010) have developed dimensionality-independent generalization bounds. Vainsencher, Mannor, and Bruckstein (2011) and Gribonval et al. (2015) have studied the dimensionality-dependent generalization bounds.

• Sparse coding. A generalization bound for sparse coding was first derived by Maurer and Pontil (2010) and subsequently extended by Xu and Lafferty (2012), Mehta and Gray (2013), Maurer, Pontil, and Romera-Paredes (2013), and Gribonval et al. (2015). Maurer et al. (2013) derived a faster convergence rate upper bound of the consistency performance in a transfer learning setting.

• K-means clustering and vector quantization. Consistency performances of k-means clustering and vector quantization have mostly been studied for . Asymptotic and nonasymptotic consistency performances have been considered by Pollard (1982), Chou (1994), Linder, Lugosi, and Zeger (1994), Bartlett, Linder, and Lugosi (1998), Linder (2000), Antos, Györfi, and György (2005), Antos (2005), and Levrard (2013). Biau, Devroye, and Lugosi (2008), Maurer and Pontil (2010), and Levrard (2015) developed dimensionality-independent generalization bounds for k-means clustering.

We are aware that these specific forms of k-dimensional coding schemes have many applications for finite-dimensional data, and only a few dimensionality-dependent methods have been developed to analyze the generalization bounds for all these coding schemes.

In this letter, we develop a dimensionality-dependent method to analyze the generalization bounds for the framework of k-dimensional coding schemes. Our method is based on Hoeffding’s inequality (Hoeffding, 1963) and the Bennett-type inequalities (Boucheron, Lugosi, & Massart, 2013) and directly bounds the covering number of the loss function class induced by the reconstruction error, which avoids inducing the worst-case upper bound on k of the loss function. Our method allows a generalization bound of order , where is much bigger than when n is small, which delicately describes the nonasymptotic behavior of the learning process. However, when n goes to infinity, approaches 0.5. The obtained dimensionality-dependent generalization bound can be much tighter than the previous ones when the number k of columns of the implementation is larger than the dimensionality m, which could often happen for dictionary learning, sparse coding, k-means clustering, and vector quantization. We therefore obtain state-of-the-art generalization bounds for NMF, dictionary learning, sparse coding, k-means clustering, and vector quantization.

The remainder of the letter is organized as follows. We present our motivation in section 2 and main results in section 3. In section 4, we apply our results to specific coding schemes and empirically compare them with state-of-the-art generalization bounds. We prove our results in section 5 and conclude the letter in section 6.

## 2  Motivation

We first introduce the dimensionality-independent generalization bounds and demonstrate why our dimensionality-dependent bound complements them.

Assume that data points are drawn from a Hilbert space with distribution . For any , let denote the set of probability distributions on supported on the closed ball of radius r centered at the origin. In other words, means that Let be bounded in the operator norm; that is, for every , it holds that for all v with . Then we also have that the columns of T are bounded as , where is the orthonormal basis of .

The following two theorems are equivalent to the main theorems proved by Maurer and Pontil (2010) but are represented in a different way. They are dimensionality-independent generalization bounds obtained in the frame of the k-dimensional coding schemes. They exploited the Rademacher complexity technique (Bartlett & Mendelson, 2003), which is suitable for deriving dimensionality-independent bounds (see Biau et al., 2008).

Theorem 1.
Assume that and Y is a closed subset of the unit ball of and that there is such that for all , . Suppose that the reconstruction error functions fT for have a range contained in . For any , with probability at least in the independently observed data , we have
Remark 1.

The dimensionality-independent generalization bound in theorem 1 is valuable because it shows a convergence rate of order .

Theorem 2.
Assume that and and that the reconstruction error functions fT for have a range contained in . For any , with probability at least in the independently observed data , we have
If is finite dimensional, the above result will be improved to
Remark 2.

The condition that Y is a closed subset of the unit ball of can be easily achieved by controlling the upper bound of columns of T because there is a trade-off between the bounds of columns of T and the entries of .

Remark 3.

We note that theorems 1 and 3 are more complicated than the original results presented in Maurer and Pontil (2010). This is because we have removed the restrictions that and , which are required to simplify their results, to reveal the intrinsic relationships between the order of k and the Rademacher complexities (discussed below). The proof methods of theorems 1 and 3 in this letter are exactly the same as those presented by Maurer and Pontil (2010).

We note that if y is in the unit ball of , then
where r, the upper bound of the data point, can be reduced by normalization. However, k is a fixed integer, whose value is usually large in practice. Thus, is the dominant factor in the upper bound of fT. It is evident that fT has the worst-case upper bound on k of order ; that is, the dependency with regard to k of the upper bound of fT has the worst-case order . However, for some special forms of k-dimensional coding schemes, the upper bound of fT has a very small order about k. Taking NMF as an example, the order about k is zero because
2.1
It is evident that the term in theorem 1 has the same order as that of the worst-case upper bound on k of fT. It will therefore be loose for some specific k-dimensional coding schemes. Maurer and Pontil (2010) introduced the proof method of theorem 3 to overcome this problem; however, the term implies that the problem is only partially solved because represents the worst-case upper bound on k of (details can be found in the proof therein). For example, in NMF, the term is of order (discussed in remark 9). The dimensionality-dependent bound in theorem 3 faces the same problem because the proof method computes the Rademacher complexity, corresponding to which part the obtained bound is dimensionality independent and involves the worst-case upper bound on k of .

We try to avoid the worst case by employing a covering number method to measure the complexity of the induced loss function class . However, in our setting, the dimensionality m of data space must be finite.

## 3  Main Results

Before presenting our main results, we introduce the definition of covering number (Zhang, 2002).

Definition 1.
Let be a metric space with metric d. Given observations and vectors , the covering number in p-norm, denoted as , is the minimum number u of a collection of vectors , such that :
3.1
where is the i-th component of vector vj. We also define .

Let . We can upper-bound the covering number of the induced loss function class of any k-dimensional coding scheme.

Lemma 1.
Let be the loss function class induced by the reconstruction error for a k-dimensional coding scheme. We have

By employing Hoeffding’s inequality (Hoeffding, 1963), we can derive a dimensionality-dependent generalization bound for k-dimensional coding schemes.

Theorem 3
(Main Result 1). Assume that and Y is a closed subset of the unit ball of and that there is such that for all , , and that the functions fT for have a range contained in . For any , with probability at least , we have

Our result is dimensionality-dependent. Compared to the bound in theorem 3, our bound could be tighter if .

Remark 4.
Let us take NMF as an example to show how our method avoids inducing the worst-case upper bound on k of the loss function compared to those of theorems 1 and 3. Regarding NMF,
3.2
If we consider only the order of , and n, our bound is of order , while theorem 1 has order and theorem 3 is of order . Our bound is tighter when .
Remark 5.

For dictionary learning, sparse coding, k-means clustering, and vector quantization, the number k of the columns of the linear implementation may be larger than the dimensionality m. If , our bound will be much tighter than the dimensionality-independent generalization bound.

Remark 6.
According to the proofs of lemma 7 and theorem 8, our result is based on the estimation of the Lipschitz constant of the loss function with regard to the implementation T. Particularly, we proved the property for all T and in , where L is a constant depending on a specific k-dimensional coding scheme. Similar to our idea, Gribonval et al. (2015) also developed dimensionality-dependent generalization bounds for k-dimensional coding schemes. However, their method is different from ours. Their results are essentially based on the property that for all T and in , where is also a constant and the operator norm of an matrix is defined as . As a result, under some assumptions (see assumptions A1–A4, B1–B3 and C1–C2 therein) and with high probability, they have that
where are constants depending on a specific k-dimensional coding scheme. Note that in most applications, and . Their bound could be looser than the derived bound in theorem 8 because in the cases, it holds that . Detailed comparisons are presented in section 4.

The result in theorem 8 can be improved by exploiting Bennett-type inequalities. We can make the upper bound to have either a smaller constant or a faster convergence rate as follows.

By employing Bernstein’s inequality, we show that a tighter generalization bound of k-dimensional coding schemes than that in theorem 8 can be derived.

Theorem 4
(Main Result 2). Assume that and Y is a closed subset of the unit ball of and that there is such that for all , , and that the functions fT for have a range contained in . For any , with probability at least , we have
Remark 7.

The upper bound in theorem 12 can be much tighter than that in theorem 8. The dominant term in the upper bound of theorem 12 is . Since the empirical reconstruction error is no bigger and sometimes much smaller than 1, the upper bound in theorem 12 can be much tighter than that in theorem 8.

We can represent the result by using the inequality that for all , .

Proposition 1.
Assume that and Y is a closed subset of the unit ball of and that there is such that for all , , and that the functions fT for have a range contained in . For any , any , and any , with probability at least , we have

We have claimed that theorem 12 and proposition 14 can be tighter than theorem 8 by saying that can be very small. However, sometimes such a term could be large. If (note that the reconstruction error function ), theorem 12 and proposition 14 will be looser than theorem 8.

The following theorem implies that by employing Bennett’s type inequality, the generalization bound can be improved no matter what the value of is.

Theorem 5
(Main Result 3). Assume that and Y is a closed subset of the unit ball of and that there is such that for all , , and that the functions fT for have a range contained in . For any , with probability at least , it holds for all that
when V satisfies that and is any positive constant.
Remark 8.

Since in theorem 15, we have that if the condition holds. For simplicity, set . If we have that , the upper bound in theorem 15 will be the same as that in theorem 8 except for a faster convergence rate. Thus, the upper bound in theorem 15 can be much tighter than that in theorem 8 in the sense that it converges much faster.

Remark 9.

The generalization bound in theorem 8 is of order , while the generalization bound in theorem 15 is of order , where when n is finite. The generalization bound in theorem 15, derived by employing Bennett’s inequality, converges faster when the sample size n is small, which is often the case in practice and describes the nonasymptotic behavior of the learning process. More empirical discussions can be found in Zhang (2013). However, when the sample size n goes to infinity, the term will approach , which means that the upper bounds in theorems 15 and 8 describe the same asymptotic behavior of the learning process.

Remark 10.
Theorem 15 looks complex, since the exponent in the convergence rate depends itself on the sample size in an implicit way. Here we show the superiority of theorem 15 by comparing it with theorem 8. From the proof of theorem 15, we can see that the theorem depends on the following inequality (see equation 5.13):
where . Note that for Hoeffding’s inequality, with any , we also have
Thus, according to Hoeffding’s inequality and the proof method of theorem 15, for all , with probability at least , it holds that
Comparing the above bound with that in theorem 15, we can see that if we interpret theorem 8 with a faster convergence rate, the upper bound there is looser than that in theorem 15 when .

Our main results in theorems 8, 12, and 15 apply to all the k-dimensional coding schemes because the covering number in lemma 7 measures the complexity of the loss function class that includes all the possible loss functions of k-dimensional coding schemes. However, for some specific k-dimensional coding schemes, the complexity of the corresponding induced loss function class can be refined. We discuss the details in the next section.1

## 4  Applications

In this section, we apply our proof methods to specific k-dimensional coding schemes. We show that our methods provide state-of-the-art dimensionality-dependent generalization bounds.

### 4.1  Nonnegative Matrix Factorization

NMF factorizes a data matrix into two nonnegative matrices and , where . NMF has been widely exploited since Lee and Seung (1999) provided a powerful psychological and physiological interpretation as a parts-based factorization and an efficient multiplicative update rule for obtaining a local solution. Many fast and robust algorithms then followed (e.g., Gillis & Vavasis, 2014; Guan, Tao, Luo, & Yuan, 2011; Liu & Tao, 2015). In all applications, both the data points and the vectors are contained in the positive orthant of a finite-dimensional space. In this case, our method for deriving dimensionality-dependent generalization bounds is likely to be superior to the method for obtaining dimensionality-independent results.

Letting , NMF can be formulated as
where is the matrix Frobenius norm.
Because if Q is a scaling matrix, we can normalize T without changing the optimization problem by choosing

If we restrict and normalize T, columns of Y will also be upper-bounded by r. This can be seen in the following lemma, which generalizes lemma 19 in Maurer and Pontil (2010):

Lemma 2.

For NMF with normalized T, if , then every column of Y is upper bounded by r; that is, for all .

For a fixed T, Y is determined by a convex problem. Thus, the reconstruction error for NMF is
4.1
and the generalization error of NMF can be analyzed under the framework of the k-dimensional coding schemes.

Using the same proof method as that of lemma 7, we have the following lemma:

Lemma 3.
Let and be the loss function class induced by the reconstruction error of NMF. We have

Then, according to the proof methods of theorems 8, 12, and 15, we have the following dimensionality-dependent generalization bounds for NMF:

Theorem 6.
For NMF, assume that and that is normalized. For any , with probability at least , it holds for all that
Since the value of is unknown in this letter (it is usually known in an optimization procedure), in the rest of the letter, we will only compare the bound in theorem 8 with state-of-the-art bounds. Theorem 8 gives the following bound for NMF:
4.2
Under the setting of theorem 21, theorem 3 yields the following bound:
4.3
Gribonval et al.’s (2015) result gives the following bound:
4.4

We then carefully compare the above generalization bounds. For NMF problems, the dimensionality m is usually very large compared to the reduced dimensionality k. We set . The comparisons are illustrated in Figure 1. The figure shows that in most cases, the derived generalization bound is tighter than state-of-the-art bounds. In Figure 1 d, the bound in equation 1.3 is tighter than the derived bound in a small range because it is dimensionality independent and is set to be much larger than the corresponding reduced dimensionality k.

Figure 1:

Comparisons of the generalization bounds of NMF. (a) The convergence of the bound in equation 4.2, where . (b) Comparing the convergence with state-of-the-art generalization bounds, where . (c) Comparing the generalization bound with state-of-the-art generalization bounds in terms of the parameter m, where . (d) Comparing the generalization bound with state-of-the-art generalization bounds in terms of the parameter k, where .

Figure 1:

Comparisons of the generalization bounds of NMF. (a) The convergence of the bound in equation 4.2, where . (b) Comparing the convergence with state-of-the-art generalization bounds, where . (c) Comparing the generalization bound with state-of-the-art generalization bounds in terms of the parameter m, where . (d) Comparing the generalization bound with state-of-the-art generalization bounds in terms of the parameter k, where .

### 4.2  Dictionary Learning

Dictionary learning tries to find a dictionary such that all observed data points can be approximated by linear combinations of atoms in the dictionary. Let the columns of T be the atoms of the dictionary. For an observation , the dictionary learning method will represent x by a linear combination of columns of T as
4.5
Thus, the reconstruction error of dictionary learning is the same as those of k-dimensional coding schemes.

Vainsencher et al. (2011) provided notable dimensionality-dependent generalization bounds for dictionary learning by considering two types of constraints on coefficient selection, respectively. For the -norm regularized coefficient selection, where every signal is approximated by a combination of, at most, p dictionary atoms, the generalization bound (theorem 35 therein) is of order under an approximate orthogonality assumption on the dictionary. For the -norm regularized coefficient selection, the generalization bound (theorem 22 therein) is of order under the requirements that , which is the upper bound of the -norm of the coefficient, is larger than , and that the signal x is mapped onto the -sphere. Our result on k-dimensional coding scheme can also be applied to dictionary learning and provides a more general bound, which does not require x to be on the -sphere or the near-orthogonality requirement and directly applies to all dictionary learning problems.

Theorem 7.
For dictionary learning, assume that , that Y is a closed subset of the unit ball of , and that every atom is bounded by . Then, for any , with probability at least , it holds for all that

The proof of theorem 22 is the same as that of theorem 21.

Remark 11.

If we substitute an upper bound into the bound in Vainsencher et al. (2011), the bound in theorem 22 therein will be of order , which has the same order as term . However, our bound in theorem 18 also shows a faster convergence rate.

Remark 12.

The method Vainsencher et al. (2011) used to upper-bound the covering number of the induced loss function class is very different from ours. To upper bound the covering number of the induced loss function class for dictionary learning, Vainsencher et al. (2011) used the knowledge that a uniform L Lipschitz mapping between metric spaces converts covers into covers. Then they focused on analyzing the Lipschitz property of the reconstruction error function that maps a dictionary into a reconstruction error, , as shown in their lemma 7. Also note that to upper-bound the Lipschitz constant of the mapping , they introduced the approximate orthogonality condition (a bound on the Babel function) on the dictionary.

Remark 13.

Analyzing the Lipschitz properties of the induced loss functions is essential for upper bounding the generalization error of k-dimensional coding schemes. Different from the method used in Vainsencher et al. (2011), Maurer and Pontil (2010) employed Slepian’s lemma to exploit the Lipschitz property, while in this letter, we also proposed a novel method, presented in the proof of theorem 8.

The comparisons of the generalization bounds of dictionary learning are similar to that of NMF because NMF can be regarded as dictionary learning in the positive orthant. We therefore omit the comparison. Many algorithms used in applications require sparsity in Y, because sparsity has advantages, such as for computation and storage. We therefore analyze sparsity in section 4.3.

### 4.3  Sparse Coding

Sparse coding requires sparsity in the codebook. We use the hard constraint discussed in Maurer and Pontil (2010), that is, , , and . Thus, we have
4.6

The following generalization bound for sparse coding is also from the work of Maurer and Pontil (2010), derived using the proof method of theorem 3.

Theorem 8.
For sparse coding, assume that . Let where . Let also assume that for all , . Then for any , with probability at least , it holds for all that

We now consider the generalization bound of sparse coding using our method. The following lemma is proved in section 5.7.

Lemma 4.
Follow the setting of theorem 26. Let be the loss function class of sparse coding. We have

Then we have the generalization bounds for sparse coding as follows:

Theorem 9.
Follow the setting of theorem 26. For any , with probability at least , it holds for all that
where .

The proof of theorem 28 is the same as that of theorem 21.

Theorem 28 gives the following bound for sparse coding:
4.7
The upper bound for sparse coding derived by Maurer and Pontil (2010) is presented in theorem 26:
4.8
Gribonval et al.’s (2015) result gives the following bound for sparse coding:
4.9

We then compare the above generalization bounds of sparse coding in Figure 2 by setting , and . The comparisons show that the derived generalization bound is tighter than state-of-the-art bounds.

Figure 2:

Comparisons of the generalization bounds of sparse coding. (a) The convergence of the bound in equation 4.7, where . (b) Comparing the convergence with state-of-the-art generalization bounds, where . (c) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter m, where . (d) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter k, where .

Figure 2:

Comparisons of the generalization bounds of sparse coding. (a) The convergence of the bound in equation 4.7, where . (b) Comparing the convergence with state-of-the-art generalization bounds, where . (c) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter m, where . (d) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter k, where .

### 4.4  Vector Quantization and k-Means Clustering

The k-means clustering (or vector quantization) method aims to find k cluster centers such that observations can be partitioned into k clusters and represented by the k cluster centers with a small reconstruction error. Taking every column of T as a cluster center and setting Y as the standard bases , we see that solving a k-means clustering problem is equal to finding an implementation T. The corresponding reconstruction error is
4.10
So the reconstruction error of k-means clustering and vector quantization is also within the framework of the reconstruction error of k-dimensional coding schemes.

The following lemma is essential for proving our dimensionality-dependent generalization bounds:

Lemma 5.
Assume that . Let be the loss function class of k-means clustering and vector quantization. Then
Theorem 10.
For k-means clustering and vector quantization, assume that and that the functions fT for have a range contained in . Then, for any , with probability at least , it holds for all that

The proof of theorem 30 is the same as that of theorem 21.

Theorem 30 gives the following bound for k-means clustering and vector quantization:
4.11
Maurer and Pontil (2010) derived the following bound:
4.12
Gribonval et al. (2015) provided the following bound:
4.13
Remark 14.

The bound in equation 4.12 has order , which is the same as the bound obtained by Biau et al. (2008). The term in theorem 30 has order . If , our bound can be tighter than that of Maurer and Pontil (2010) and the result in Biau et al. (2008). The generalization bounds derived by Maurer and Pontil (2010) and Biau et al. (2008) also have an advantage that they converge faster. As discussed in Bartlett et al. (1998), Linder et al. (1994), and Devroye, Györfi, and Lugosi (1996), the factor in theorem 30 can be removed by the sophisticated uniform large-deviation inequalities of Alexander (1984) or Talagrand (1994). However, Devroye et al. (1996) proved that (in their theorem 12.10) the fast convergence upper bound has an astronomically large constant. The corresponding convergence bound is therefore loose. Our generalization bound, which is derived by exploiting Bennett’s inequality, will be tighter if the empirical reconstruction error is small.

We compare the above generalization bounds of k-means clustering and vector quantization in Figure 3 by setting . For k-means clustering and vector quantization problems, the dimensionality m can be independent of the reduced dimensionality k. Figure 3 shows that when k is not very large, the derived bound is tighter than state-of-the-art generalization bounds.

Figure 3:

Comparisons of the generalization bounds of k-means clustering and vector quantization. (a) The convergence of the bound in equation 4.11, where . (b) Comparing the convergence with state-of-the-art generalization bounds, where . (c) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter m, where . (d) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter k, where .

Figure 3:

Comparisons of the generalization bounds of k-means clustering and vector quantization. (a) The convergence of the bound in equation 4.11, where . (b) Comparing the convergence with state-of-the-art generalization bounds, where . (c) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter m, where . (d) Comparing the generalization bound with state-of-the-art bounds in terms of the parameter k, where .

## 5  Proofs

In this section we prove the main results in section 2 and some of the results presented in section 3.

### 5.1  Concentration Inequalities

In this section, we introduce the concentration inequalities that will be used to prove our assertions.

We first present Hoeffding’s inequality (Hoeffding, 1963), which is widely used for deriving generalization bounds.

Theorem 11
(Hoeffding’s inequality). Let be a sample set of independent random variables such that for some almost surely for all . Then for any and , the following inequality holds:

We will also use Bernstein’s inequality and Bennett’s inequality (Boucheron et al., 2013; Zhang, 2013) to derive generalization bounds.

Theorem 12
(Bernstein’s inequality). Let be a sample set of independent random variables such that for some and is no bigger than V for some almost surely for all . Then for any and , the following inequality holds:
5.1
Theorem 13
(Bennett’s inequality). Let be a sample set of independent random variables such that for some and is no bigger than V for some almost surely for all . Then, for any and , the following inequality holds:
5.2
where for .

### 5.2  Proof of Lemma 7

We will bound the covering number of the loss function class by bounding the covering number of the implementation class . Cutting the subspace into small m-dimensional regular solids with width , there are
such regular solids. If we pick out the centers of these regular solids and use them to make up T, there are
choices, denoted by . Then is the upper bound of the -cover of the implementation class .
We will prove that for every T, there exists a such that
where . The proof is as follows:
5.3
The last inequality holds because of the triangle inequality. We have
5.4
Let denote the loss function class for the algorithms when searching for implementations and the metric d be the metric that . According to definition 6, for , there is a such that
5.5
Thus,
5.6
Taking log on both sides, we have
5.7

### 5.3  Proof of Theorem 8

We first prove the following theorem, which is useful to prove theorem 8.

Theorem 14.
Let be a set of independent random variables such that for some almost surely for all and . Then for any and , with probability at least , we have
where and .
Proof.
Since is a set of independent random variables, according to Hoeffding’s inequality, for any , with probability at least , we have
Let be a minimal -cover of FT. Then . By a union bound of probability, we have that with probability at least , the following holds:
5.8
It can be easily verified that
5.9
Combining inequalities 5.8 and 5.9, and letting , we have that with probability at least , the following holds:
5.10
This concludes the proof.

Theorem 8 can be proven by combining theorem 35 and lemma 7. We can also prove proposition 14 using the same method as that of theorem 8.

### 5.4  Proof of Theorem 12

According to Bernstein’s inequality, we have the following theorem, which is useful to prove theorem 12.

Theorem 15.
Let be a set of independent random variables such that almost surely for all and . Then for any and , with probability at least , we have
Proof.
Since is a set of independent random variables, according to Bernstein’s inequality, for any , with probability at least , we have
5.11
We also have that because . Collecting the terms in , completing the square, and solving for shows that with probability at least , we have
5.12
Straightforward substitution of inequality 5.12 into inequality 5.11 shows that with probability at least , we have
Similar to the proof of theorem 35, by a union bound of probability, we then have that with probability at least , the following holds:
This concludes the proof.

Theorem 12 can be proven by combining theorem 36 and lemma 7.

### 5.5  Proof of Theorem 12

The following theorem, derived by exploiting Bennett’s inequality, is essential to prove theorem 15.

Theorem 16.
Let be a set of independent random variables such that almost surely for all and . Then for any and , with probability at least , it holds for all that
when V is no smaller than and there is a positive constant such that .

Theorem 37 can be easily proven by using Bernstein’s inequality. However, to show the faster convergence property, we propose a new method to prove Bernstein’s inequality, which needs the following lemma:

Lemma 6.
For and , there exists some and such that the following holds:
Let be i.i.d. variables such that , , and are almost surely for all . Then, for any , with probability at least , we have
Proof.
We prove the first part. We have
It holds that

Thus, there are many pairs of such that the first part of lemma 38 holds.

We then prove Bernstein’s inequality and the second part. According to Bennett’s inequality, we have
5.13
which is the Bernstein’s inequality.
To prove the second part, let . We have
5.14
5.15
5.16
For any , let
5.17
Then, with probability at least , we have
5.18
Combining equations 5.17 and 5.18 with probability at least , we have
5.19
and
5.20
Combining equations 5.18 and 5.20 with probability at least , we have
5.21
Thus, the second part of lemma 38 holds.

Similar to the proof of theorem 35, theorem 37 can be proven by using lemma 38 and a union bound of probability.

Theorem 15 can be proven by combining theorem 37 and lemma 7.

### 5.6  Proof of Lemma 19

The proof method is the same as that of Lemma 19 in Maurer and Pontil (2010).

Proof.
Let
5.22
Assume that y is a minimizer of h and . Because T is normalized, . Then
5.23
Let the real-valued function f be defined as
5.24
Then
5.25
So f cannot have a minimum at 1, whence y cannot be a minimizer of h. Thus, the minimizer y must be contained in the ball with radius r in the m-dimensional space.

### 5.7  Proof of Lemma 27

As in the proof of lemma 7, we can pick out a set , where , having the property that for every T, there exists a such that with . The detail is as follows:
5.26
Using Hölder’s inequality, we have
5.27
Using Hölder’s inequality again, we have inequalities 5.28 and 5.29:
5.28
and
5.29
Combining inequalities 5.28 and 5.29, it gives
5.30
Combining inequalities 5.26, 5.27, and 5.30, we have
5.31
According to definition 6, for , there is a such that
5.32
Thus,
5.33
Taking log on both sides, we have
which concludes the proof.

### 5.8  Proof of Lemma 29

The proof method of lemma 29 is similar to that of lemma 7.

Proof.
For k-means clustering and vector quantization, we can easily prove that . As in the proof of lemmas 7 and 27, we can pick out a set , where , having the property that for every T, there exists a such that with . The proof is as follows:
5.34
Thus,
5.35
Taking log on both sides, we have
5.36
which concludes the proof.

## 6  Conclusion

We have proposed a method to analyze the dimensionality-dependent generalization bounds for k-dimensional coding schemes, which are the abstract and general descriptions of a set of methods that encode random vectors in Hilbert space . There are several specific forms of k-dimensional coding schemes, including NMF, dictionary learning, sparse coding, k-means clustering, and vector quantization, which have achieved great successes in pattern recognition and machine learning.

Our proof approach is based on an upper bound for the covering number of the loss function class induced by the reconstruction error. We explained that the covering number is more suitable for deriving dimensionality-dependent generalization bounds for k-dimensional coding schemes, because it avoids the worst-case dependency with regard to the number k of the columns of the linear implementation. If k is larger than the dimensionality m, our bound could be much tighter than the dimensionality-independent generalization bound. Moreover, according to Bennett’s inequality, we derived a dimensionality-dependent generalization bound of order , where when the sample size n is finite, for k-dimensional coding schemes. Our method therefore provides state-of-the-art dimensionality-dependent generalization bounds for NMF, dictionary learning, sparse coding, k-means clustering, and vector quantization.

## Acknowledgments

T.L. and D.T. were supported in part by Australian Research Council Projects DP-140102164 and FT-130101457. D.X. was partially supported by funding from the Faculty of Engineering and Information Technologies, University of Sydney, under the Faculty Research Cluster Program.

## References

Abbott
,
L.
, &
Dayan
,
P.
(
1999
).
The effect of correlated variability on the accuracy of a population code
.
Neural Computation
,
11
(
1
),
91
101
.
Alexander
,
K.
(
1984
).
Probability inequalities for empirical processes and a law of the iterated logarithm
.
Annals of Probability
,
12
(
4
),
1041
1067
.
Amiri
,
A.
, &
Haykin
,
S.
(
2014
).
Improved sparse coding under the influence of perceptual attention
.
Neural Computation
,
26
(
2
),
377
420
.
Anderberg
,
M. R.
(
1973
).
Cluster analysis for applications.
New York
:
.
Antos
,
A.
(
2005
).
Improved minimax bounds on the test and training distortion of empirically designed vector quantizers
.
IEEE Transactions on Information Theory
,
51
(
11
),
4022
4032
.
Antos
,
A.
,
Györfi
,
L.
, &
György
,
A.
(
2005
).
Improved convergence rates in empirical vector quantizer design
.
IEEE Transactions on Information Theory
,
51
(
11
),
4013
4022
.
Bartlett
,
P. L.
,
Linder
,
T.
, &
Lugosi
,
G.
(
1998
).
The minimax distortion redundancy in empirical quantizer design
.
IEEE Transactions on Information Theory
,
44
(
5
),
1802
1813
.
Bartlett
,
P. L.
, &
Mendelson
,
S.
(
2003
).
Rademacher and gaussian complexities: Risk bounds and structural results
.
Journal of Machine Learning Research
,
3
,
463
482
.
Biau
,
G.
,
Devroye
,
L.
, &
Lugosi
,
G.
(
2008
).
On the performance of clustering in Hilbert spaces
.
IEEE Transactions on Information Theory
,
54
(
2
),
781
790
.
Boucheron
,
S.
,
Lugosi
,
G.
, &
Massart
,
P.
(
2013
).
Concentration inequalities: A nonasymptotic theory of independence.
New York
:
Oxford University Press
.
Chen
,
S. S.
,
Donoho
,
D. L.
, &
Saunders
,
M. A.
(
1999
).
Atomic decomposition by basis pursuit
.
SIAM Journal on Scientific Computing
,
20
(
1
),
33
61
.
Chou
,
P. A.
(
1994
).
The distortion of vector quantizers trained on n vectors decreases to the optimum at
. In
Proceedings of the IEEE International Symposium on Information Theory.
Piscataway, NJ
:
IEEE
.
Cucker
,
F.
, &
Smale
,
S.
(
2002
).
On the mathematical foundations of learning
.
Bulletin of the American Mathematical Society
,
39
(
1
),
1
49
.
Devroye
,
L.
,
Györfi
,
L.
, &
Lugosi
,
G.
(
1996
).
A probabilistic theory of pattern recognition.
New York
:
Springer
.
Dhillon
,
I. S.
,
Guan
,
Y.
, &
Kulis
,
B.
(
2007
).
Weighted graph cuts without eigenvectors a multilevel approach
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
29
(
11
),
1944
1957
.
Ding
,
C.
,
He
,
X.
, &
Simon
,
H. D.
(
2005
).
On the equivalence of nonnegative matrix factorization and spectral clustering
. In
Proceedings of the SIAM International Conference on Data Mining.
:
SIAM
.
Févotte
,
C.
,
Bertin
,
N.
, &
Durrieu
,
J.-L.
(
2009
).
Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis
.
Neural Computation
,
21
(
3
),
793
830
.
Gillis
,
N.
, &
Vavasis
,
S. A.
(
2014
).
Fast and robust recursive algorithms for separable nonnegative matrix factorization
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
36
(
4
),
698
714
.
Gong
,
M.
,
Zhang
,
K.
,
Schölkopf
,
B.
,
Tao
,
D.
, &
Geiger
,
P.
(
2015
).
Discovering temporal causal relations from subsampled data
. In
Proceedings of the 32nd International Conference on Machine Learning
(pp.
1898
1906
).
JMLR
.
Gray
,
R.
(
1984
).
Vector quantization
.
IEEE Acoustics, Speech and Signal Processing Magazine
,
1
(
2
),
4
29
.
Gribonval
,
R.
,
Jenatton
,
R.
,
Bach
,
F.
,
Kleinsteuber
,
M.
, &
Seibert
,
M.
(
2015
).
Sample complexity of dictionary learning and other matrix factorizations
.
IEEE Transactions on Information Theory
,
61
(
6
),
3469
3486
.
Guan
,
N.
,
Tao
,
D.
,
Luo
,
Z.
, &
Yuan
,
B.
(
2011
).
Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent
.
IEEE Transactions on Image Processing
,
20
(
7
),
2030
2048
.
Guan
,
N.
,
Tao
,
D.
,
Luo
,
Z.
, &
Yuan
,
B.
(
2012
).
NeNMF: An optimal gradient method for nonnegative matrix factorization
.
IEEE Transactions on Signal Processing
,
60
(
6
),
2882
2898
.
Gui
,
J.
,
Sun
,
Z.
,
Ji
,
S.
,
Tao
,
D.
, &
Tan
,
T.
(
2016
).
Feature selection based on structured sparsity: A comprehensive study
.
IEEE Transactions on Neural Networks and Learning Systems
. doi:10.1109/TNNLS.2013.2551724
Hoeffding
,
W.
(
1963
).
Probability inequalities for sums of bounded random variables
.
Journal of the American Statistical Association
,
58
(
301
),
13
30
.
Hunt
,
J. J.
,
Ibbotson
,
M.
, &
Goodhill
,
G. J.
(
2012
).
Sparse coding on the spot: Spontaneous retinal waves suffice for orientation selectivity
.
Neural Computation
,
24
(
9
),
2422
2433
.
Ivana
,
T.
, &
Pascal
,
F.
(
2011
).
Dictionary learning: What is the right representation for my signal
?
IEEE Signal Processing Magazine
,
4
(
2
),
27
38
.
Kanungo
,
T.
,
Mount
,
D. M.
,
Netanyahu
,
N. S.
,
Piatko
,
C. D.
,
Silverman
,
R.
, &
Wu
,
A. Y.
(
2002
).
An efficient k-means clustering algorithm: Analysis and implementation
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
24
(
7
),
881
892
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
1999
).
Learning the parts of objects by nonnegative matrix factorization
.
Nature
,
401
(
6755
),
788
791
.
Levrard
,
C.
(
2013
).
Fast rates for empirical vector quantization
.
Electronic Journal of Statistics
,
7
,
1716
1746
.
Levrard
,
C.
(
2015
).
Nonasymptotic bounds for vector quantization in Hilbert spaces
.
Annals of Statistics
,
43
(
2
),
592
619
.
Linder
,
T.
(
2000
).
On the training distortion of vector quantizers
.
IEEE Transactions on Information Theory
,
46
(
4
),
1617
1623
.
Linder
,
T.
,
Lugosi
,
G.
, &
Zeger
,
K.
(
1994
).
Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding
.
IEEE Transactions on Information Theory
,
40
(
6
),
1728
1740
.
Liu
,
H.
,
Liu
,
T.
,
Wu
,
J.
,
Tao
,
D.
, &
Fu
,
Y.
(
2015
).
Spectral ensemble clustering
. In
Proceedings of the ACM SIGKKD International Conference on Knowledge Discovery and Data Mining
(pp.
715
724
).
New York
:
ACM
.
Liu
,
H.
,
Shao
,
M.
,
Li
,
S.
, &
Fu
,
Y.
(
2016
).
Infinite ensemble for image clustering
. In
Proceedings of the ACM SIGKKD International Conference on Knowledge Discovery and Data Mining
New York
:
ACM
.
Liu
,
H.
,
Wu
,
J.
,
Tao
,
D.
,
Zhang
,
Y.
, &
Fu
,
Y.
(
2015
).
Dias: A disassemble-assemble framework for highly sparse text clustering
. In
Proceedings of the SIAM International Conference on Data Mining
(pp.
766
774
).
:
SIAM
.
Liu
,
T.
, &
Tao
,
D.
(
2015
).
On the performance of manhattan nonnegative matrix factorization
.
IEEE Transactions on Neural Networks and Learning Systems
. doi:10.1109/TNNLS.2015.2458986
Liu
,
W.
,
Tao
,
D.
,
Cheng
,
J.
, &
Tang
,
Y.
(
2014
).
Multiview Hessian discriminative sparse coding for image annotation
.
Computer Vision and Image Understanding
,
118
,
50
60
.
MacQueen
,
J.
(
1967
).
Some methods for classification and analysis of multivariate observations
. In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.
Berkeley
:
University of California Press
.
Mairal
,
J.
,
Bach
,
F.
, &
Ponce
,
J.
(
2012
).
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
34
(
4
),
791
804
.
Maurer
,
A.
, &
Pontil
,
M.
(
2010
).
K-dimensional coding schemes in Hilbert spaces
.
IEEE Transactions on Information Theory
,
56
(
11
),
5839
5846
.
Maurer
,
A.
,
Pontil
,
M.
, &
Romera-Paredes
,
B.
(
2013
).
Sparse coding for multitask and transfer learning
. In
Proceedings of the International Conference on Machine Learning.
New York
:
ACM
.
Mehta
,
N.
, &
Gray
,
A. G.
(
2013
).
Sparsity-based generalization bounds for predictive sparse coding
. In
Proceedings of the International Conference on Machine Learning.
New York
:
ACM
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
1996
).
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
.
Nature
,
381
(
6583
),
607
609
.
Pehlevan
,
C.
,
Hu
,
T.
, &
Chklovskii
,
D. B.
(
2015
).
A Hebbian/anti-Hebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data
.
Neural Computation
,
27
(
1
),
1461
1495
.
Pollard
,
D.
(
1982
).
A central limit theorem for k-means clustering
.
IEEE Transactions on Information Theory
,
10
(
4
),
912-926
.
Quiroga
,
R. Q.
,
,
Z.
, &
Ben-Shaul
,
Y.
(
2004
).
Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering
.
Neural Computation
,
16
(
8
),
1661
1687
.
Schneider
,
P.
,
Biehl
,
M.
, &
Hammer
,
B.
(
2009a
).
Adaptive relevance matrices in learning vector quantization
.
Neural Computation
,
21
(
12
),
3532
3561
.
Schneider
,
P.
,
Biehl
,
M.
, &
Hammer
,
B.
(
2009b
).
Distance learning in discriminative vector quantization
.
Neural Computation
,
21
(
10
),
2942
2969
.
Talagrand
,
M.
(
1994
).
Sharper bounds for gaussian and empirical processes
.
Annals of Probability
,
22
(
1
),
28
76
.
Vainsencher
,
D.
,
Mannor
,
S.
, &
Bruckstein
,
A. M.
(
2011
).
The sample complexity of dictionary learning
.
Journal of Machine Learning Research
,
12
,
3259
3281
.
Vapnik
,
V.
(
2000
).
The nature of statistical learning theory.
New York
:
Springer
.
Wright
,
J.
,
Yang
,
A. Y.
,
Ganesh
,
A.
,
Sastry
,
S. S.
, &
Ma
,
Y.
(
2009
).
Robust face recognition via sparse representation
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
31
(
2
),
210
227
.
Xu
,
M.
, &
Lafferty
,
J. D.
(
2012
).
Conditional sparse coding and grouped multivariate regression
. In
Proceedings of the International Conference on Machine Learning.
New York
:
ACM
.
Zhang
,
C.
(
2013
).
Bennett type generalization bounds: Large deviation case and faster rate of convergence
. In
Proceedings of the Conference on Uncertainty in Artificial Intelligence.
New York
:
ACM
.
Zhang
,
T.
(
2002
).
Covering number bounds of certain regularized linear function classes
.
Journal of Machine Learning Research
,
2
,
527
550
.

## Note

1

Even though the faster convergence interpretation in theorem 15 is interesting, it looks complicated, and the upper bound is almost as tight as that of theorem 12. Therefore, we do not discuss its applications for specific k-dimensional coding schemes.