Learning an appropriate (dis)similarity function from the available data is a central problem in machine learning, since the success of many machine learning algorithms critically depends on the choice of a similarity function to compare examples. Despite many approaches to similarity metric learning that have been proposed, there has been little theoretical study on the links between similarity metric learning and the classification performance of the resulting classifier. In this letter, we propose a regularized similarity learning formulation associated with general matrix norms and establish their generalization bounds. We show that the generalization error of the resulting linear classifier can be bounded by the derived generalization bound of similarity learning. This shows that a good generalization of the learned similarity function guarantees a good classification of the resulting linear classifier. Our results extend and improve those obtained by Bellet, Habrard, and Sebban (2012). Due to the techniques dependent on the notion of uniform stability (Bousquet & Elisseeff, 2002), the bound obtained there holds true only for the Frobenius matrix-norm regularization. Our techniques using the Rademacher complexity (Bartlett & Mendelson, 2002) and its related Khinchin-type inequality enable us to establish bounds for regularized similarity learning formulations associated with general matrix norms, including sparse L1-norm and mixed (2,1)-norm.
The success of many machine learning algorithms heavily depends on how to specify the similarity or distance metric between examples. For instance, the k-nearest neighbor (k-NN) classifier depends on a distance (dissimilarity) function to identify the nearest neighbors for classification. Most information retrieval methods rely on a similarity function to identify the data points that are most similar to a given query. Kernel methods rely on the kernel function to represent the similarity between examples. Hence, how to learn an appropriate (dis)similarity function from the available data is a central problem in machine learning, which we refer to as similarity metric learning throughout the letter.
Recently considerable research effort has been devoted to similarity metric learning, and many methods have been proposed. They can be broadly divided into two main categories. The first category is a one-stage approach for similarity metric learning, which means that the methods learn the similarity (kernel) function and classifier together. Multiple kernel learning (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004; Varma & Babu, 2009) is a notable one-stage approach that aims to learn an optimal kernel combination from a prescribed set of positive semidefinite (PSD) kernels. Another exemplary one-stage approach is indefinite kernel learning, which is motivated by the fact that in many applications, potential kernel matrices could be nonpositive semidefinite. Such cases include hyperbolic tangent kernels (Smola, Oari, & Williamson, 2001) and the protein sequence similarity measures derived from Smith-Waterman and BLAST score (Saigo, Vert, Ueda, & Akutsu, 2004). Indefinite kernel learning (Chen, Garcia, Gupta, Rahimi, & Cazzanti, 2009; Ying, Campbell, & Girolami, 2009) aims to learn a PSD kernel matrix from a prescribed indefinite kernel matrix, which is mostly restricted to the transductive settings. Other methods (Wu & Zhou, 2005; Wu, 2013) have analyzed regularization networks such as ridge regression and SVM given a prescribed indefinite kernel instead of aiming to learn an indefinite kernel function from data. The generalization analysis for such one-stage methods has been studied (see, e.g., Chen et al., 2009; Cortes, Mohri, & Rostamizadeh, 2010a; Ying & Campbell, 2009).
The second category of similarity metric learning is a two-stage method, which means that the processes of learning the similarity function and training the classifier are separate. One exemplar two-stage approach, referred to as metric learning (Bar-Hillel, Hertz, Shental, & Weinshall, 2005; Davis, Kulis, Jain, Sra, & Dhillon, 2007; Hoi, Liu, Lyu, & Ma, 2006; Jin, Wang, & Zhou, 2009; Weinberger, Blitzer, & Saul, 2005; Xing, Jordan, Russell, & Ng, 2002; Ying, Huang, & Campbell, 2009), often focuses on learning a Mahalanobis distance metric defined, for any , by . Here, M is a positive semidefinite (PSD) matrix. Another example of such methods (Chechik, Sharma, Shalit, & Bengio, 2010; Maurer, 2008) is bilinear similarity learning, which focuses on learning a similarity function defined, for any by with M being a PSD matrix. These methods are mainly motivated by the natural intuition that the similarity score between examples in the same class should be larger than that of examples from distinct classes. The k-NN classification using the similarity metric learned from these methods was empirically shown to achieve better accuracy than that using the standard Euclidean distance.
Although many two-stage approaches for similarity metric learning have been proposed, in contrast to the one-stage methods, there is relatively little theoretical work on the question of whether similarity-based learning guarantees a good generalization of the resultant classification. For instance, generalization bounds were recently established for metric and similarity learning (Cao, Guo, & Ying, 2012; Jin et al., 2009; Maurer, 2008) under different statistical assumptions on the data. However, there are no theoretical guarantees for such empirical success. In other words, it is not clear whether good generalization bounds for metric and similarity learning (Cao et al., 2012; Jin et al., 2009) can lead to good classification of the resultant k-NN classifiers. Recently, Bellet, Habrard, and Sebban (2012) proposed a regularized similarity learning approach, which is mainly motivated by the -good similarity functions introduced in Balcan and Blum (2006); Balcan, Blum, and Srebro (2008). In particular, they showed that the proposed similarity learning can theoretically guarantee good generalization for classification. However, due to the techniques dependent on the notion of uniform stability (Bousquet & Elisseeff, 2002), the generalization bounds hold true only for strongly convex matrix-norm regularization (e.g., the Frobenius norm).
In this letter, we consider a new similarity learning formulation associated with general matrix-norm regularization terms. Its generalization bounds are established for various matrix regularizations including the Frobenius norm, sparse L1-norm, and mixed (2,1)-norm (see definitions below). The learned similarity matrix is used to design a sparse classification algorithm, and we prove that the generalization error of its resultant linear classifier can be bounded by the derived generalization bound for similarity learning. This implies that the proposed similarity learning with general matrix-norm regularization guarantees good generalization for classification. Our techniques using the Rademacher complexity (Bartlett & Mendelson, 2002) and the important Khinchin-type inequality for the Rademacher variables enables us to derive bounds for general matrix-norm regularization, including the sparse L1-norm and mixed (2,1)-norm regularization.
The remainder of this letter is organized as follows. In section 2, we propose the similarity learning formulations with general matrix-norm regularization terms and state the main theorems. In particular, the results will be illustrated using various examples. The related work is discussed in section 3. The generalization bounds for similarity learning are established in section 4. In section 5, we develop a theoretical link between the generalization bounds of the proposed similarity learning method and the generalization error of the linear classifier built from the learned similarity function. Section 6 estimates the Rademacher averages and gives the proof for examples in section 2. Section 7 summarizes this letter and points to some possible directions for future research.
2. Regularization Formulation and Main Results
In this section, we introduce the regularized formulation of similarity learning and state our main results. Before we do that, we introduce some notations and present some background material.
Denote, for any , Let be a set of training samples, which is drawn identically and independently from a distribution on Here, the input space is a domain in , and is called the output space. Let denote the set of symmetric matrices. For any we consider as a bilinear similarity score parameterized by a symmetric matrix The symmetry of matrix A guarantees the symmetry of the similarity score KA, that is,
The proof for theorem 1 is given in section 4. Recently, some studies (Cao et al., 2012; Kar & Jain, 2011; Kumar, Niculescu-Mizil, Kavukcoglu, & Daum, 2012) have given generalization bounds for similarity (kernel) learning, where the involved empirical term is a U-statistics term. Hence, the natural idea in the above studies is to explore properties of U-statistics (Clémencon, Lugosi, & Vayatis, 2008; de La peña & Giné, 1999) for analyzing the related similarity learning formulations. The proof for theorem 1 differs from earlier approaches (Cao et al., 2012; Kar & Jain, 2011) since the empirical term, equation 2.1, in formulation 2.2, is not a U-statistics term (there is more discussion in section 3 on this topic).
Now we are in a position to state the relationship between the similarity learning and the generalization error of the linear classifier:
The proof for theorem 2 will be established in section 5.
Theorems 1 and 2 depend critically on two terms: the constant and the Rademacher average Below, we list the estimation of these two terms associated with different matrix norms. For any vector , denote
Consider the sparse L1-norm defined, for any , by Let Az and fz be defined, respectively, by equations 2.2 and 2.6. Then, we have the following results:
For any vector , let be the standard Euclidean norm. Considering the regularized similarity learning with the Frobenius matrix norm, we have the following result:
Consider the Frobenius matrix norm defined, for any , by Let Az and fz be defined by equations 2.2 and 2.6, respectively. Then we have the following estimation:
We end this section with two remarks. First, theorem 2 and the examples mean that a good similarity (i.e., a small for similarity learning) can guarantee a good classification (i.e., a small generalization error ). Second, the bounds in example 2 are consistent with those in Bellet et al. (2012).
3. Related Work
In this section, we discuss studies on similarity metric learning that are related to our work.
Balcan et al. (2008) developed a theory of -good similarity function defined as follows. It attempts to investigate the theoretical relationship between the properties of a similarity function and its performance in linear classification:
A similarity function K is a -good similarity function in hinge loss for a learning problem P if there exists a random indicator function R(x) defining a probabilistic set of “reasonable points” such that the following conditions hold:
The first condition can be interpreted as “ proportion of points x are on average more similar to random reasonable points of the same class than to random reasonable points of the distinct classes” and the second condition as “at least a proportion of the points should be reasonable.” The following theorem implies that if given an -good similarity function and enough landmarks, there exists a separator with error arbitrarily close to
Let K be an -good similarity function in hinge loss for a learning problem P. For any and let be a potentially unlabeled sample of landmarks drawn from P. Consider the mapping Then, with probability at least over the random sample S, the induced distribution in has a linear separator of error at most at margin
Recent work by Bellet et al. (2012) is mostly close to ours. Specifically, they considered similarity learning formulation (see equation 2.2) with the Frobenius norm regularization. Generalization bounds for similarity learning were derived using uniform stability arguments (Bousquet & Elisseeff, 2002), which cannot deal with, for instance, the L1-norm and (2,1)-norm regularization terms. In addition, the results about the relationship between the similarity learning and the performance of the learned matrix in classification were quoted from Balcan et al. (2008) and hence requires two separate sets of samples to train the classifier.
Kar and Jain (2011, 2012) introduced an extended framework of Balcan and Blum (2006) and Balcan et al. (2008) in the general setting of supervised learning. The authors proposed a general goodness criterion for similarity functions, which can handle general supervised learning tasks and also subsumes the goodness of condition of Balcan et al. (2008). There, efficient algorithms were constructed with provable generalization error bounds. The main distinction between this work and our own work is that we aim to learn a similarity function, while in their work, a similarity function is defined in advance.
4. Generalization Bounds for Similarity Learning
Proof of theorem 1. Our proof is divided into two steps.
Now we are in a position to estimate the first term in the expectation form on the right-hand side of equation 4.1 by standard symmetrization techniques.
5. Guaranteed Classification via Good Similarity
In this section, we investigate the theoretical relationship between the generalization error of the similarity learning and that of the linear classifier built from the learned similarity metric In particular, we will show that the generalization error of the similarity learning gives an upper bound for the generalization error of the linear classifier, which was stated as theorem 2 in section 2.
Now we are in a position to give the proof of theorem 2:
6. Estimating Rademacher Averages
Theorems 1, 2, and 4 critically depend on the estimation of the Rademacher average defined by equation 2.4. In this section, we establish a self-contained proof for this estimation and prove the examples listed in section 2. For notational simplicity, denote by the th variable of the ith sample
We now turn our attention to the similarity learning formulation, equation 2.2, with the Frobenius norm regularization:
The above generalization bound for the similarity learning formulation, equation 2.2, with the Frobenius norm regularization is consistent with that given in Bellet et al. (2012), where the result holds true under the assumption that . Next, we provide the estimation of respectively for the mixed (2,1)-norm and the trace norm:
Consider the similarity learning formulation, equation 2.2, with the mixed (2,1)-norm regularization Then we have the following estimation:
Hence, the estimation, equation 6.7, for is optimal up to the constant Furthermore, ignoring further estimation , the above estimations mean that the estimation for in the case of trace-norm regularization is the same as the equation 6.5 estimation for the Frobenius norm regularization. Consequently, the generalization bounds for similarity learning and the relationship between similarity learning and the linear SVM are the same as those stated in example 2. It is a bit disappointing that there is no improvement when using the trace norm. The possible reason is that the spectral norm of B and the Frobenius norm of B are the same when B takes the form B=xyT for any
We end this section with a comment on an alternate way to estimate the Rademacher average . Kakade et al. (2008, 2012) developed elegant techniques for estimating Rademacher averages for linear predictors. In particular, the following theorem was established:
However, for the case of trace-norm regularization (), one would expect, using the techniques in Kakade et al. (2008, 2012), that the estimation for is the same as that in the case for the sparse L1-norm. The main hurdle for such a result is the estimation of by the trace norm of A. Indeed, by the discussion following our estimation, equation 6.7, directly using Khinchin-type inequality, we know that our estimation is optimal. Hence, one cannot expect that the estimation for for the case for trace-norm regularization is the same as that in the case for sparse L1-norm regularization in our particular case of the similarity learning formulation, equation 2.2.
We end this section with an open question. It is not clear to us how to establish a generic result for estimating the interesting Rademacher average given by equation 6.8. Such a generic result is expected to be very similar to the result stated above as theorem 5, which was established by Kakade et al. (2008, 2012). The main advantage of establishing such a generic result would enable a unifying estimation of for different matrix norms, which can then be instantiated into examples 1, 2, and 3.
In this letter, we have considered a regularized similarity learning formulation, equation 2.2. Its generalization bounds were established for various matrix-norm regularization terms such as the Frobenius norm, sparse L1-norm, and mixed (2,1)-norm. We proved that the generalization error of the linear classifier based on the learned similarity function can be bounded by the derived generalization bound of similarity learning. This guarantees the goodness of the generalization of similarity learning (see equation 2.2) with general matrix-norm regularization and thus the classification generalization of the resulting linear classifier. Our techniques using the Rademacher complexity (Bartlett & Mendelson, 2002) and the important Khinchin-type inequality for the Rademacher variables allow us to obtain new bounds for similarity learning with general matrix-norm regularization terms.
There are several possible directions for future work. First, we may consider similarity algorithms with general loss functions. It is expected that under some convexity conditions on the loss functions, better results could be obtained. Second, we usually focus on the excess misclassification error when considering classification problems. Hence, in the future, we would like to consider the theoretical link between the generalization bounds of the similarity learning and the excess misclassification error of the classifier built from the learned similarity function.
We need the following contraction property of the Rademacher averages, which is essentially implied by theorem 4.12 in Ledoux and Talagrand (1991: see also Bartlett & Mendelson, 2002; Koltchinskii & Panchenko, 2002).
Another important property of the Rademacher average, which is used in the proof of the generalization bounds of the similarity learning, is the following Khinchin-type inequality (see de la Peña & Giné, 1999, theorem 3.2.2):
We are grateful to the referees for their invaluable comments and suggestions on this letter. This work was supported by the EPSRC under grant EP/J001384/1.
Z.-C. Guo is now at the Department of Mathematics, Zhejiang University, Hangzhou 310027, China.