The k-dimensional coding schemes refer to a collection of methods that attempt to represent data using a set of representative k-dimensional vectors and include nonnegative matrix factorization, dictionary learning, sparse coding, k-means clustering, and vector quantization as special cases. Previous generalization bounds for the reconstruction error of the k-dimensional coding schemes are mainly dimensionality-independent. A major advantage of these bounds is that they can be used to analyze the generalization error when data are mapped into an infinite- or high-dimensional feature space. However, many applications use finite-dimensional data features. Can we obtain dimensionality-dependent generalization bounds for k-dimensional coding schemes that are tighter than dimensionality-independent bounds when data are in a finite-dimensional feature space? Yes. In this letter, we address this problem and derive a dimensionality-dependent generalization bound for k-dimensional coding schemes by bounding the covering number of the loss function class induced by the reconstruction error. The bound is of order , where m is the dimension of features, k is the number of the columns in the linear implementation of coding schemes, and n is the size of sample, when n is finite and when n is infinite. We show that our bound can be tighter than previous results because it avoids inducing the worst-case upper bound on k of the loss function. The proposed generalization bound is also applied to some specific coding schemes to demonstrate that the dimensionality-dependent bound is an indispensable complement to the dimensionality-independent generalization bounds.
1.1 Related Work
Dictionary learning. Maurer and Pontil (2010) have developed dimensionality-independent generalization bounds. Vainsencher, Mannor, and Bruckstein (2011) and Gribonval et al. (2015) have studied the dimensionality-dependent generalization bounds.
Sparse coding. A generalization bound for sparse coding was first derived by Maurer and Pontil (2010) and subsequently extended by Xu and Lafferty (2012), Mehta and Gray (2013), Maurer, Pontil, and Romera-Paredes (2013), and Gribonval et al. (2015). Maurer et al. (2013) derived a faster convergence rate upper bound of the consistency performance in a transfer learning setting.
K-means clustering and vector quantization. Consistency performances of k-means clustering and vector quantization have mostly been studied for . Asymptotic and nonasymptotic consistency performances have been considered by Pollard (1982), Chou (1994), Linder, Lugosi, and Zeger (1994), Bartlett, Linder, and Lugosi (1998), Linder (2000), Antos, Györfi, and György (2005), Antos (2005), and Levrard (2013). Biau, Devroye, and Lugosi (2008), Maurer and Pontil (2010), and Levrard (2015) developed dimensionality-independent generalization bounds for k-means clustering.
We are aware that these specific forms of k-dimensional coding schemes have many applications for finite-dimensional data, and only a few dimensionality-dependent methods have been developed to analyze the generalization bounds for all these coding schemes.
In this letter, we develop a dimensionality-dependent method to analyze the generalization bounds for the framework of k-dimensional coding schemes. Our method is based on Hoeffding’s inequality (Hoeffding, 1963) and the Bennett-type inequalities (Boucheron, Lugosi, & Massart, 2013) and directly bounds the covering number of the loss function class induced by the reconstruction error, which avoids inducing the worst-case upper bound on k of the loss function. Our method allows a generalization bound of order , where is much bigger than when n is small, which delicately describes the nonasymptotic behavior of the learning process. However, when n goes to infinity, approaches 0.5. The obtained dimensionality-dependent generalization bound can be much tighter than the previous ones when the number k of columns of the implementation is larger than the dimensionality m, which could often happen for dictionary learning, sparse coding, k-means clustering, and vector quantization. We therefore obtain state-of-the-art generalization bounds for NMF, dictionary learning, sparse coding, k-means clustering, and vector quantization.
The remainder of the letter is organized as follows. We present our motivation in section 2 and main results in section 3. In section 4, we apply our results to specific coding schemes and empirically compare them with state-of-the-art generalization bounds. We prove our results in section 5 and conclude the letter in section 6.
We first introduce the dimensionality-independent generalization bounds and demonstrate why our dimensionality-dependent bound complements them.
Assume that data points are drawn from a Hilbert space with distribution . For any , let denote the set of probability distributions on supported on the closed ball of radius r centered at the origin. In other words, means that Let be bounded in the operator norm; that is, for every , it holds that for all v with . Then we also have that the columns of T are bounded as , where is the orthonormal basis of .
The following two theorems are equivalent to the main theorems proved by Maurer and Pontil (2010) but are represented in a different way. They are dimensionality-independent generalization bounds obtained in the frame of the k-dimensional coding schemes. They exploited the Rademacher complexity technique (Bartlett & Mendelson, 2003), which is suitable for deriving dimensionality-independent bounds (see Biau et al., 2008).
The dimensionality-independent generalization bound in theorem 1 is valuable because it shows a convergence rate of order .
The condition that Y is a closed subset of the unit ball of can be easily achieved by controlling the upper bound of columns of T because there is a trade-off between the bounds of columns of T and the entries of .
We note that theorems 1 and 3 are more complicated than the original results presented in Maurer and Pontil (2010). This is because we have removed the restrictions that and , which are required to simplify their results, to reveal the intrinsic relationships between the order of k and the Rademacher complexities (discussed below). The proof methods of theorems 1 and 3 in this letter are exactly the same as those presented by Maurer and Pontil (2010).
We try to avoid the worst case by employing a covering number method to measure the complexity of the induced loss function class . However, in our setting, the dimensionality m of data space must be finite.
3 Main Results
Before presenting our main results, we introduce the definition of covering number (Zhang, 2002).
Let . We can upper-bound the covering number of the induced loss function class of any k-dimensional coding scheme.
By employing Hoeffding’s inequality (Hoeffding, 1963), we can derive a dimensionality-dependent generalization bound for k-dimensional coding schemes.
Our result is dimensionality-dependent. Compared to the bound in theorem 3, our bound could be tighter if .
For dictionary learning, sparse coding, k-means clustering, and vector quantization, the number k of the columns of the linear implementation may be larger than the dimensionality m. If , our bound will be much tighter than the dimensionality-independent generalization bound.
The result in theorem 8 can be improved by exploiting Bennett-type inequalities. We can make the upper bound to have either a smaller constant or a faster convergence rate as follows.
By employing Bernstein’s inequality, we show that a tighter generalization bound of k-dimensional coding schemes than that in theorem 8 can be derived.
The upper bound in theorem 12 can be much tighter than that in theorem 8. The dominant term in the upper bound of theorem 12 is . Since the empirical reconstruction error is no bigger and sometimes much smaller than 1, the upper bound in theorem 12 can be much tighter than that in theorem 8.
We can represent the result by using the inequality that for all , .
We have claimed that theorem 12 and proposition 14 can be tighter than theorem 8 by saying that can be very small. However, sometimes such a term could be large. If (note that the reconstruction error function ), theorem 12 and proposition 14 will be looser than theorem 8.
The following theorem implies that by employing Bennett’s type inequality, the generalization bound can be improved no matter what the value of is.
Since in theorem 15, we have that if the condition holds. For simplicity, set . If we have that , the upper bound in theorem 15 will be the same as that in theorem 8 except for a faster convergence rate. Thus, the upper bound in theorem 15 can be much tighter than that in theorem 8 in the sense that it converges much faster.
The generalization bound in theorem 8 is of order , while the generalization bound in theorem 15 is of order , where when n is finite. The generalization bound in theorem 15, derived by employing Bennett’s inequality, converges faster when the sample size n is small, which is often the case in practice and describes the nonasymptotic behavior of the learning process. More empirical discussions can be found in Zhang (2013). However, when the sample size n goes to infinity, the term will approach , which means that the upper bounds in theorems 15 and 8 describe the same asymptotic behavior of the learning process.
Our main results in theorems 8, 12, and 15 apply to all the k-dimensional coding schemes because the covering number in lemma 7 measures the complexity of the loss function class that includes all the possible loss functions of k-dimensional coding schemes. However, for some specific k-dimensional coding schemes, the complexity of the corresponding induced loss function class can be refined. We discuss the details in the next section.1
In this section, we apply our proof methods to specific k-dimensional coding schemes. We show that our methods provide state-of-the-art dimensionality-dependent generalization bounds.
4.1 Nonnegative Matrix Factorization
NMF factorizes a data matrix into two nonnegative matrices and , where . NMF has been widely exploited since Lee and Seung (1999) provided a powerful psychological and physiological interpretation as a parts-based factorization and an efficient multiplicative update rule for obtaining a local solution. Many fast and robust algorithms then followed (e.g., Gillis & Vavasis, 2014; Guan, Tao, Luo, & Yuan, 2011; Liu & Tao, 2015). In all applications, both the data points and the vectors are contained in the positive orthant of a finite-dimensional space. In this case, our method for deriving dimensionality-dependent generalization bounds is likely to be superior to the method for obtaining dimensionality-independent results.
For NMF with normalized T, if , then every column of Y is upper bounded by r; that is, for all .
Using the same proof method as that of lemma 7, we have the following lemma:
We then carefully compare the above generalization bounds. For NMF problems, the dimensionality m is usually very large compared to the reduced dimensionality k. We set . The comparisons are illustrated in Figure 1. The figure shows that in most cases, the derived generalization bound is tighter than state-of-the-art bounds. In Figure 1 d, the bound in equation 1.3 is tighter than the derived bound in a small range because it is dimensionality independent and is set to be much larger than the corresponding reduced dimensionality k.
4.2 Dictionary Learning
Vainsencher et al. (2011) provided notable dimensionality-dependent generalization bounds for dictionary learning by considering two types of constraints on coefficient selection, respectively. For the -norm regularized coefficient selection, where every signal is approximated by a combination of, at most, p dictionary atoms, the generalization bound (theorem 35 therein) is of order under an approximate orthogonality assumption on the dictionary. For the -norm regularized coefficient selection, the generalization bound (theorem 22 therein) is of order under the requirements that , which is the upper bound of the -norm of the coefficient, is larger than , and that the signal x is mapped onto the -sphere. Our result on k-dimensional coding scheme can also be applied to dictionary learning and provides a more general bound, which does not require x to be on the -sphere or the near-orthogonality requirement and directly applies to all dictionary learning problems.
The method Vainsencher et al. (2011) used to upper-bound the covering number of the induced loss function class is very different from ours. To upper bound the covering number of the induced loss function class for dictionary learning, Vainsencher et al. (2011) used the knowledge that a uniform L Lipschitz mapping between metric spaces converts covers into covers. Then they focused on analyzing the Lipschitz property of the reconstruction error function that maps a dictionary into a reconstruction error, , as shown in their lemma 7. Also note that to upper-bound the Lipschitz constant of the mapping , they introduced the approximate orthogonality condition (a bound on the Babel function) on the dictionary.
Analyzing the Lipschitz properties of the induced loss functions is essential for upper bounding the generalization error of k-dimensional coding schemes. Different from the method used in Vainsencher et al. (2011), Maurer and Pontil (2010) employed Slepian’s lemma to exploit the Lipschitz property, while in this letter, we also proposed a novel method, presented in the proof of theorem 8.
The comparisons of the generalization bounds of dictionary learning are similar to that of NMF because NMF can be regarded as dictionary learning in the positive orthant. We therefore omit the comparison. Many algorithms used in applications require sparsity in Y, because sparsity has advantages, such as for computation and storage. We therefore analyze sparsity in section 4.3.
4.3 Sparse Coding
We now consider the generalization bound of sparse coding using our method. The following lemma is proved in section 5.7.
Then we have the generalization bounds for sparse coding as follows:
We then compare the above generalization bounds of sparse coding in Figure 2 by setting , and . The comparisons show that the derived generalization bound is tighter than state-of-the-art bounds.
4.4 Vector Quantization and k-Means Clustering
The following lemma is essential for proving our dimensionality-dependent generalization bounds:
The bound in equation 4.12 has order , which is the same as the bound obtained by Biau et al. (2008). The term in theorem 30 has order . If , our bound can be tighter than that of Maurer and Pontil (2010) and the result in Biau et al. (2008). The generalization bounds derived by Maurer and Pontil (2010) and Biau et al. (2008) also have an advantage that they converge faster. As discussed in Bartlett et al. (1998), Linder et al. (1994), and Devroye, Györfi, and Lugosi (1996), the factor in theorem 30 can be removed by the sophisticated uniform large-deviation inequalities of Alexander (1984) or Talagrand (1994). However, Devroye et al. (1996) proved that (in their theorem 12.10) the fast convergence upper bound has an astronomically large constant. The corresponding convergence bound is therefore loose. Our generalization bound, which is derived by exploiting Bennett’s inequality, will be tighter if the empirical reconstruction error is small.
We compare the above generalization bounds of k-means clustering and vector quantization in Figure 3 by setting . For k-means clustering and vector quantization problems, the dimensionality m can be independent of the reduced dimensionality k. Figure 3 shows that when k is not very large, the derived bound is tighter than state-of-the-art generalization bounds.
5.1 Concentration Inequalities
In this section, we introduce the concentration inequalities that will be used to prove our assertions.
We first present Hoeffding’s inequality (Hoeffding, 1963), which is widely used for deriving generalization bounds.
5.2 Proof of Lemma 7
5.3 Proof of Theorem 8
We first prove the following theorem, which is useful to prove theorem 8.
5.4 Proof of Theorem 12
According to Bernstein’s inequality, we have the following theorem, which is useful to prove theorem 12.
5.5 Proof of Theorem 12
The following theorem, derived by exploiting Bennett’s inequality, is essential to prove theorem 15.
Theorem 37 can be easily proven by using Bernstein’s inequality. However, to show the faster convergence property, we propose a new method to prove Bernstein’s inequality, which needs the following lemma:
Thus, there are many pairs of such that the first part of lemma 38 holds.
5.6 Proof of Lemma 19
5.7 Proof of Lemma 27
5.8 Proof of Lemma 29
We have proposed a method to analyze the dimensionality-dependent generalization bounds for k-dimensional coding schemes, which are the abstract and general descriptions of a set of methods that encode random vectors in Hilbert space . There are several specific forms of k-dimensional coding schemes, including NMF, dictionary learning, sparse coding, k-means clustering, and vector quantization, which have achieved great successes in pattern recognition and machine learning.
Our proof approach is based on an upper bound for the covering number of the loss function class induced by the reconstruction error. We explained that the covering number is more suitable for deriving dimensionality-dependent generalization bounds for k-dimensional coding schemes, because it avoids the worst-case dependency with regard to the number k of the columns of the linear implementation. If k is larger than the dimensionality m, our bound could be much tighter than the dimensionality-independent generalization bound. Moreover, according to Bennett’s inequality, we derived a dimensionality-dependent generalization bound of order , where when the sample size n is finite, for k-dimensional coding schemes. Our method therefore provides state-of-the-art dimensionality-dependent generalization bounds for NMF, dictionary learning, sparse coding, k-means clustering, and vector quantization.
T.L. and D.T. were supported in part by Australian Research Council Projects DP-140102164 and FT-130101457. D.X. was partially supported by funding from the Faculty of Engineering and Information Technologies, University of Sydney, under the Faculty Research Cluster Program.