Abstract

We propose a general information-theoretic approach to semi-supervised metric learning called SERAPH (SEmi-supervised metRic leArning Paradigm with Hypersparsity) that does not rely on the manifold assumption. Given the probability parameterized by a Mahalanobis distance, we maximize its entropy on labeled data and minimize its entropy on unlabeled data following entropy regularization. For metric learning, entropy regularization improves manifold regularization by considering the dissimilarity information of unlabeled data in the unsupervised part, and hence it allows the supervised and unsupervised parts to be integrated in a natural and meaningful way. Moreover, we regularize SERAPH by trace-norm regularization to encourage low-dimensional projections associated with the distance metric. The nonconvex optimization problem of SERAPH could be solved efficiently and stably by either a gradient projection algorithm or an EM-like iterative algorithm whose M-step is convex. Experiments demonstrate that SERAPH compares favorably with many well-known metric learning methods, and the learned Mahalanobis distance possesses high discriminability even under noisy environments.

1.  Introduction

How to learn a good distance metric for the input data domain is a crucial issue for many distance-based learning algorithms. The goal of metric learning is to find a new metric under which “similar” data are close and “dissimilar” data are far apart (Xing, Ng, Jordan, & Russell, 2003). The great majority of metric learning methods developed in the last decade fall into three types:

  1. Supervised type requiring class labels (Chiaromonte & Cook, 2002; Sugiyama, 2007; Fukumizu, Bach, & Jordan, 2009).1 Two data points with the same label are regarded as similar, and those with different labels are regarded as dissimilar.

  2. Supervised type requiring weak labels, that is, -valued labels that indicate the similarity and dissimilarity of data pairs directly (Xing, Ng, Jordan, & Russell, 2003; Goldberger, Roweis, Hinton, & Salakhutdinov, 2005; Weinberger, Blitzer, & Saul, 2006; Globerson & Roweis, 2006; Torresani & Lee, 2007; Davis, Kulis, Jain, Sra, & Dhillon, 2007). See Figure 1.

  3. Unsupervised type that requires no label information (Roweis & Saul, 2000; Tenenbaum, de Silva, & Langford, 2000; Belkin & Niyogi, 2002). Unlike previous types, the similarity and dissimilarity here are extracted from data instead of being given as supervision.

Figure 1:

Illustration of supervised metric learning based on weak labels. In this figure, there are three classes, each with two labeled data. The goal is to find a new metric under which data in the same class are close and data from different classes are far apart. Note that the original class labels will not be revealed to metric learning algorithms, and we show the projected data here, since the Mahalanobis distance of the original data equals the Euclidean distance of the projected data.

Figure 1:

Illustration of supervised metric learning based on weak labels. In this figure, there are three classes, each with two labeled data. The goal is to find a new metric under which data in the same class are close and data from different classes are far apart. Note that the original class labels will not be revealed to metric learning algorithms, and we show the projected data here, since the Mahalanobis distance of the original data equals the Euclidean distance of the projected data.

The second type has been extensively studied, since weak labels are much cheaper than class labels when the number of classes is fairly large. That said, supervised metric learning based on weak labels still has a strict limitation. Algorithms of this type need each data point to be involved in at least one weak label; otherwise these algorithms cannot see that data point because it never exists. This limitation is often problematic for real-world applications and needs to be fixed.

Based on the belief that preserving the geometric structure of all labeled and unlabeled data in an unsupervised manner can be better than strongly relying on the limited labeled data, semi-supervised metric learning has emerged. To the best of our knowledge, all previous semi-supervised methods that extend types 1 and 2 employ off-the-shelf unsupervised techniques in type 3—for example:

  • • 

    Principal component analysis (Yang, Jin, Sukthankar, & Liu, 2006; Sugiyama, Idé, Nakajima, & Sese, 2010);

  • • 

    Manifold regularization or embedding (Hoi, Liu, & Chang, 2008; Baghshah & Shouraki, 2009; Zha, Mei, Wang, Wang, & Hua, 2009; Liu, Ma, Tao, Liu, & Liu, 2010).

More specifically, they rely on the manifold assumption and implement the following:

  • • 

    If two data points are near each other under the original metric, pull them so they not are not far away under the new metric.

  • • 

    If two data points are far from each other under the original metric, do nothing.

In the second case, we should not push the two data points farther apart under the new metric, since they may be connected by the data manifold and should be close together under the new metric even though they were originally far apart. By implementing these two cases, those semi-supervised methods successfully extract the similarity information of unlabeled data.

However, there remain two issues. First, the methods ignore the dissimilarity information of unlabeled data. This can be a huge waste of information, since most unlabeled data pairs would be dissimilar if the number of underlying classes is large and the classes are balanced. To this end, an appealing semi-supervised metric learning method should be able to make use of the dissimilarity information of unlabeled data. Second, the similarity of unlabeled data extracted by those methods is measured by closeness under the original metric, and it is inconsistent with the similarity of labeled data. Recall that metric learning aims at finding a new metric, and weak labels indicating similar but far apart data pairs are in principle the most informative ones. Therefore, under the original metric, closeness is not the reason for the similarity of labeled data, whereas it is the reason for the similarity of unlabeled data. In contrast, similarity and closeness generally imply each other for both labeled and unlabeled data under the new metric. To this end, an appealing method should focus on the new metric when extracting the similarity information of unlabeled data. In fact, the unsupervised parts of the existing methods that rely on the manifold assumption and implement the two cases (if two data points are close and if they are far apart) are inconsistent with their supervised parts in terms of these two issues. Simply putting them together works in practice, but this paradigm is conceptually neither natural nor unified.

In this letter, we propose a general information-theoretic approach to semi-supervised metric learning called SERAPH (SEmi-supervised metRic leArning Paradigm with Hyper-sparsity) in order to address these issues. It extracts not only the similarity information but also the dissimilarity information of unlabeled data, and to do so, it accesses the new metric rather than the original one. Our idea is to optimize a new Mahalanobis distance metric through optimizing a conditional probability parameterized by that metric. We maximize the entropy of this probability on labeled data pairs and minimize the entropy of this probability on unlabeled data pairs following entropy regularization (Grandvalet & Bengio, 2005), which can achieve the sparsity of the posterior distribution (Graça, Ganchev, Taskar, & Pereira, 2009; Gillenwater, Ganchev, Graça, Pereira, & Taskar, 2011) that is, unlabeled data pairs can be classified with high confidence. Furthermore, we employ mixed-norm regularization (Argyriou, Evgeniou, & Pontil, 2007) to encourage the sparsity of projection matrices associated with the new metric in terms of their singular values (Ying, Huang, & Campbell, 2009), and the new metric can carry out dimensionality reduction implicitly and adaptively. Unifying the posterior sparsity and the projection sparsity brings to us the hypersparsity. Thanks to this hypersparsity, the new metric learned by SERAPH possesses high discriminability even under noisy environments.

We make three contributions with this work. First, we formulate supervised metric learning based on weak labels as an instance of the generalized maximum entropy distribution estimation (Dudík & Schapire, 2006). Second, we propose an extension of this estimation to semi-supervised metric learning via entropy regularization (Grandvalet & Bengio, 2005). It considers the dissimilarity information of unlabeled data based on the Mahalanobis distance being learned. Third, we develop two ways to solve the nonconvex optimization problem involved in this extension: a direct gradient projection algorithm and an indirect EM-like iterative algorithm.

The rest of this letter is organized as follows. The SERAPH model is formulated in section 2, and then two algorithms are developed in section 3 to solve the optimization problem involved in the model. In section 4, we discuss three forms of sparsity and two additional justifications of the model. A comparison with related work is made in section 5. Experimental results are reported in section 6. In section 7, we offer two extensions to SERAPH. We give concluding remarks and future work in section 8.

2.  SERAPH, the Model

In this section, we first propose the supervised part of the SERAPH model as a generalized maximum entropy estimation for supervised metric learning based on weak labels and then introduce two additional regularization terms via entropy regularization and trace-norm regularization.

2.1.  Problem Setting.

Suppose that we have a training set that contains n points, each with m features. Let the set of similar data pairs be
formula
and the set of dissimilar data pairs be
formula
With some abuse of terminology, we refer to as the labeled data and
formula
as the unlabeled data. A weak label yi,j is assigned to (xi, xj) such that
formula
We abbreviate , and to , and , respectively, for simplicity. Consider learning a Mahalanobis distance metric for of the form
formula
2.1
where is the transpose operator and is a symmetric positive semi-definite matrix to be learned.2 The probability of labeling with is denoted by which is explicitly parameterized by the matrix A. When the pair comes from , is abbreviated as pAi,j(y).

2.2.  Basic Model.

We derive a probabilistic model to investigate the conditional probability of given . We use a parametric form of and will focus on this form because it is optimal in the following sense.

The maximum entropy principle (Jaynes, 1957; Berger, Pietra, & Pietra, 1996) suggests choosing in the probability distribution with the maximum entropy out of all probability distributions that match the data moments. Let3
formula
be the entropy of the conditional probability pAi,j(y), and
formula
be a feature function that is convex with respect to A. Then the constrained optimization problem is
formula
2.2
where is a slack variable and is a regularization parameter. After the introduction of , distributions are allowed to match two data moments in a way that is not strictly exact. The penalty in the objective function presumes the gaussian prior of the expected data moment,
formula
from the empirical data moment,
formula
which is consistent with the generalized maximum entropy principle (Dudík & Schapire, 2006). (See section 4.2 for an alternative explanation of optimization 2.2 in the sense of the generalized maximum entropy principle, particularly the need to introduce the slack variable from a theoretical point of view.)
Theorem 1.
The primal solution to optimization 2.2 can be given in terms of the dual solution by
formula
2.3
where
formula
is the partition function, and can be obtained by solving the dual problem
formula
2.4
Let take the form of in equation 2.3. Define the regularized log-likelihood function on labeled data (i.e., on observed weak labels) as
formula
Then for supervised metric learning, the regularized maximum log-likelihood estimation,
formula
and the generalized maximum entropy estimation 2.2 are equivalent.4
When considering the feature function as the building block of the expected and empirical data moments, we propose5
formula
2.5
where is a hyperparameter that serves as the threshold to separate the similar and dissimilar data pairs in and under the new metric . Now the probabilistic model 2.3 becomes
formula
For the optimal solution and reasonable , we hope for two properties:
  • • 
    The feature function can indicate the correctness of the observed weak labels, that is,
    formula
  • • 
    The probabilistic model can correctly classify the observed weak labels, that is,
    formula

Therefore, there must be .

Note that the generalized maximum entropy estimation for supervised metric learning is a general framework; it is not limited to supervised metric learning based on weak labels. Although we use equation 2.5 as our feature function, other feature functions emphasizing different perspectives of the metric information are possible. For instance, a local distance metric feature function,
formula
replaces the global threshold with a local one and focuses on the changes of pairwise distances. In fact, optimization 2.2 can be applied to other problem settings such as multilabel metric learning with a global distance metric feature function,
formula
where the labels y and are binary-valued vectors.

2.3.  Regularization.

In this section, we extend defined above to semi-supervised metric learning via entropy regularization and further regularize it by trace-norm regularization.

Our unsupervised part extracts both the similarity and dissimilarity information of unlabeled data according to the new Mahalanobis distance metric . In order to do so, it follows the minimum entropy principle (Grandvalet & Bengio, 2005), and hence pAi,j(y) should have low entropy (which in turn means low uncertainty) for unlabeled data . Generally the resultant discriminative probabilistic models prefer peaked distributions on unlabeled data such that unlabeled data can be classified with high confidence, which can carry out a probabilistic low-density separation. Subsequently, according to Grandvalet and Bengio (2005), our optimization becomes
formula
where is a regularization parameter.
In addition, we hope for the dimensionality-reduction ability by encouraging low-rank projection matrices associated with A. It would be helpful in dealing with corrupted data or data distributed intrinsically in a low-dimensional subspace. It is known that the trace is a convex relaxation of the rank for positive semi-definite matrices, so we revise our optimization problem into
formula
2.6
where tr(A) is the trace of A and is a regularization parameter.

The optimization problem, equation 2.6, is the final model of SERAPH. We say that it is equipped with hypersparsity when both and are positive and hence both regularization terms are active. The hypersparsity, as well as the posterior and projection sparsity, will be discussed in section 4.1. Moreover, SERAPH possesses standard kernel and manifold extensions, which we explain in sections 7.1 and 7.2, respectively.

3.  SERAPH, the Algorithm

In this section, we reduce optimization 2.6 to a form that is easy to handle, and develop two practical algorithms for solving the reduced optimization.

3.1.  Reduction.

While optimization 2.6 involves a dual variable , we would like to focus on the variable A just as many previous metric learning methods have. Theorem 2 guarantees that we can eliminate from equation 2.6 to get an equivalent but simpler optimization, thanks to the fact that we use a single feature function, equation 2.5, in optimization 2.2.

Theorem 2.
Define the reduced optimization problem as6
formula
3.1
where the reduced probabilistic model is
formula
3.2
Let be a locally optimal solution to equation 2.6. Then, there exist well-defined and , such that is also a locally optimal solution to optimization 3.1, and it satisfies the following:
  • parameterized by is equivalent to parameterized by , that is,
    formula
  • parameterized by and is identical to the original parameterized by , and , that is,
    formula
Remark 1.

After the reduction of theorem 2, has been dropped, and have been modified, but the regularization parameter remains the same, which means that the trade-off between the supervised and unsupervised parts has not been affected.

3.2.  Two Algorithms.

There are several approaches for solving optimization 3.1, for example, gradient projection and expectation maximization (Grandvalet & Bengio, 2006). By no means can an approach always be better than another for nonconvex optimizations. Hence, we explore both of them and find they can solve equation 3.1 efficiently and stably.

Our first solver for equation 3.1 is a direct gradient projection algorithm (Polyak, 1967). The gradient matrix is simply
formula
3.5
The projection of the symmetric matrix resulting from a gradient update back to the cone of symmetric positive semi-definite matrices includes eigendecomposing that symmetric matrix and recovering it from its positive eigenvalues and eigenvectors associated with those eigenvalues. Although this algorithm must converge, many heuristic tricks are necessary in order to find a reasonable locally optimal solution to equation 3.1 since the unsupervised part is highly nonconvex.
Our second solver for equation 3.1 is an indirect EM-like iterative algorithm. We first initialize a probability for each pair . The initial solution in our current implementation is , which means that at the start, we treat all unlabeled pairs as dissimilar. Then the M-step and the E-step are executed repeatedly until certain stopping conditions are satisfied. At the tth M-step, we find a new metric A(t) through a surrogate optimization,
formula
3.6
where is generated in the last E-step. Since the feature function is convex with respect to A, the objective function is concave with respect to A, and optimization 3.6 is convex according to Boyd and Vandenberghe (2004). Thus, we could solve optimization 3.6 using the gradient projection method without worrying about local maxima, where the gradient matrix is
formula
3.7
At the tth E-step, we update for each pair as
formula
3.8
where pAi,j(y) is parameterized by A(t) found in the last M-step. Although this algorithm may not converge, it works fairly well in practice. No matter how we design the M-step, it is insensitive to the step size of the gradient update, and it gives a deterministic solution after fixing the initial solution and the stopping conditions. In other words, the EM-like iterative algorithm can easily be derandomized by the initial solution and the stopping conditions, which is a nice algorithmic property for nonconvex optimizations.

Details of the implementation are in appendix B.

3.3.  Theoreticalc Analyses.

The gradient projection and EM-like algorithms can solve optimization 3.1 efficiently and stably. Let us determine their asymptotic time complexities. Generally the asymptotic time complexity of both algorithms is O(n2m+m3), where n is the number of data and m is the number of features (recall that the training set contains n points, each with m features). Each iteration of the gradient projection algorithm consumes O(n2m+nm2) for the gradient update and O(m3) for the projection, which has an asymptotic time complexity O(n2m+m3), since O(nm2) could never dominate O(n2m) and O(m3) simultaneously. Additionally, it is common to set in advance a maximum number of iterations TGP for such a nonconvex optimization solver, and the overall asymptotic time complexity of the gradient projection algorithm is
formula
For the EM-like iterative algorithm, each iteration of the M-step is same as the gradient projection algorithm, and each E-step costs O(n2), which is negligible compared with the computational complexity of the full M-step. As a consequence, the overall asymptotic time complexity of the EM-like algorithm is
formula
where is the maximum number of iterations of the M-step and TEM is the maximum number of iterations of the EM-like algorithm.

It is obvious that which algorithm is empirically faster depends primarily on which of TGP or is smaller. In fact, the gradient projection method for equation 3.6 is much easier than for equation 3.1 since equation 3.6 is a convex optimization, which means the M-step of the EM-like algorithm itself is much easier than the gradient projection algorithm. Furthermore, it is unnecessary to solve the M-step exactly in such an EM-like algorithm. As a result, is supposed to be significantly smaller than TGP. On the other hand, the temporary A(t) of EM-like iterations makes up a deterministic sequence for fixed initial , and a small TEM is usually enough for finding a reasonable solution. To sum up, we can set to be smaller than TGP in practice and then expect the EM-like algorithm to be faster than the gradient projection algorithm with comparable qualities of the learned distance metrics.

The gradient projection and EM-like algorithms are not only computationally efficient but also computationally stable. The following theorem shows that the gradient matrices of and given in equations 3.5 and 3.7 are uniformly bounded, regardless of the scale of A, that is, the magnitude of tr(A). It also implies that compared with maximizing , maximizing should be stabler even without considering that is a concave function.

Theorem 3.
The objective functions and of optimizations 3.1 and 3.6 are Lipschitz continuous, and the best Lipschitz constants with respect to the Frobenius norm satisfy
formula
3.9
formula
3.10
where is the diameter of , and # measures the cardinality of a set.
Last but not least, we comment on equation 3.8, the E-step of the EM-like algorithm. It has the same idea as the deterministic annealing EM-like algorithm in Grandvalet and Bengio (2006), and it is the analytical solution to
formula
similar to Graça et al. (2009) and Gillenwater et al. (2011), where KL is the Kullback-Leibler divergence. It is easy to see that our E-step is different from the standard E-step if , while for any , it approaches the standard one as . In other words, the EM-like algorithm does not solve optimization 3.1 exactly, but this optimization is indeed the limit of a sequence of optimizations that the algorithm solves at different EM-like iterations. If and t=0, becomes the hard assignments
formula
This is the reason for initializing in our current implementation.

4.  Discussion

We left out a few theoretical arguments when we proposed the SERAPH model in order to keep the presentation as concise and comprehensible as possible. In this section, we discuss the sparsity issue in the sense of metric learning and present two additional justifications for our model.

4.1.  Posterior Sparsity and Projection Sparsity.

Sparse metric learning might have different meanings, since we learn a metric with low-rank linear projections by optimizing a conditional probability, where the optimization variable is actually a square matrix. First, we explain the meaning of our sparsity and claim that we can obtain the posterior sparsity (Graça et al., 2009) by entropy regularization and the projection sparsity (Ying et al., 2009) by trace-norm regularization. The arguments are as follows.

By a “sparse” posterior distribution, we mean that the uncertainty (e.g., the entropy or variance) of pAi,j(y) for is low, such that (xi, xj) can be classified as a similar or dissimilar pair with high confidence. Figure 2 is an illustrative example. Recall that supervised metric learning aims at finding a new distance metric under which data in the same class are close and data from different classes are far apart. It would result in the metric that ignores the horizontal feature and focuses on only the vertical feature. Nevertheless, the horizontal feature is useful, and taking care of the posterior sparsity would lead to a better metric as shown in Figures 2e and 2f. As a consequence, we prefer taking the posterior sparsity into account in addition to the goal of supervised metric learning, and then the risk of overfitting weakly labeled data can be significantly reduced.

Figure 2:

Sparse versus nonsparse posterior distributions. Six weak labels were constructed according to the four class labels. The left three panels show the original data and the projected data by metrics learned with and without the posterior sparsity. The right three panels exhibit one-nearest-neighbor classification results based on the Euclidean distance and the two learned metrics.

Figure 2:

Sparse versus nonsparse posterior distributions. Six weak labels were constructed according to the four class labels. The left three panels show the original data and the projected data by metrics learned with and without the posterior sparsity. The right three panels exhibit one-nearest-neighbor classification results based on the Euclidean distance and the two learned metrics.

When considering the posterior sparsity, our optimization via entropy regularization is equivalent to soft posterior regularization (Graça et al., 2009; Gillenwater et al., 2011), that is, we can rewrite as an objective function of a soft posterior regularization. More specifically, let the auxiliary feature function be
formula
Then maximizing is equivalent to
formula
4.1
On the other hand, according to optimization (7) of Graça et al. (2009), the soft posterior regularization objective should take form as
formula
4.2
where are slack variables. Since q is unconstrained, we can optimize q with respect to fixed A and . It is easy to see that q should be pA restricted on , so the KL divergence term is zero and the expectation term is the entropy, which implies the equivalence of optimizations 4.1 and 4.2.
Besides the posterior sparsity, we also hope for the projection sparsity that may guide the new distance metric to a better generalization performance. Figure 3 illustrates its effect, where the horizontal feature is dominant and the vertical feature is uninformative. The underlying technique is known as mixed-norm regularization (Argyriou et al., 2007) or group lasso (Yuan & Lin, 2006). Denote the -norm of a symmetric matrix M as
formula
Similarly to Ying et al. (2009), let be a linear projection, be the symmetric positive semi-definite matrix of the metric induced from P, and Pi and Wi be the ith columns of P and W. If Pi is identically zero, the ith component of x has no contribution to z=Px. Since the column-wise sparsity of W and P is equivalent, we can reach the column-wise sparsity of P by penalizing . Nevertheless, this is the ability of feature selection rather than dimensionality reduction. Note that the goal is to select a few most representative directions of input data that are not restricted to the coordinate axes. The solution is to pick an extra transformation to rotate x before projecting x where is the set of orthonormal matrices of size m. Consequently, we penalize , project x to z=PVx, and since , we arrive at
formula
4.3
Remember that the final model of SERAPH was given by optimization 2.6 as
formula
The equivalence of optimizations 2.6 and 4.3 is guaranteed by lemma 1 of Ying et al. (2009). By unifying the posterior sparsity and the projection sparsity mentioned above, we obtain a property that we call the hypersparsity.
Figure 3:

Sparse versus nonsparse projections. Twenty-eight weak labels were constructed according to the eight class labels. The left three panels show the original data and the projected data by metrics learned with and without the projection sparsity. The right three panels exhibit one-nearest-neighbor classification results based on the Euclidean distance and the two learned metrics.

Figure 3:

Sparse versus nonsparse projections. Twenty-eight weak labels were constructed according to the eight class labels. The left three panels show the original data and the projected data by metrics learned with and without the projection sparsity. The right three panels exhibit one-nearest-neighbor classification results based on the Euclidean distance and the two learned metrics.

4.2.  Generalized Maximum Entropy Principle.

The basic model defined in optimization 2.2 contains an inequality constraint instead of some equality constraint, since the regularization term in is indispensable. Otherwise we would have for the optimal solution , which means that the optimization would be degenerated, and the learned metric might easily overfit weakly labeled data. This phenomenon is owing to the single-point prior of the expected data moment from the empirical data moment. The regularization term reflects the gaussian prior in the generalized maximum entropy principle (Dudík & Schapire, 2006), while the ordinary maximum entropy principle (Jaynes, 1957; Berger et al., 1996) assumes the single-point prior and applies no regularization on the dual variable.

The potential function underlies the generalized maximum entropy distribution estimation. By the potential function and the slack variable, we could obtain the same dual problem. Let the potential function and its target value uf be
formula
Redefine optimization 2.2 as an equivalent form
formula
where the equivalence is due to Fenchel's duality theorem of Dudík and Schapire (2006) plus the fact that the conjugate of Uf(x) is . Subsequently,
formula
is an optimization problem with two potential functions and under the posterior regularization framework (Graça, Ganchev, & Taskar, 2008; Graça et al. 2009; Bellare, Druck, & McCallum, 2009, Gillenwater et al., 2011), and hence SERAPH can be viewed as a semi supervised maximum entropy estimation equipped with the additional projection sparsity.

4.3.  Information Maximization Principle.

The final model defined in optimization 2.6 can also be viewed as an information-maximization approach to semi-supervised metric learning based on weak labels. The regularized information-maximization framework (Gomes, Krause, & Perona, 2010) advocates the preference for maximizing the mutual information between data and labels as well as the need to regularize the model parameters.

Let p(y) be the prior distribution,
formula
and be its estimate,
formula
Let be the mutual information between the data pair and the weak label,
formula
and be its estimate, that is, the mutual information between unlabeled data and unobserved weak labels,
formula
Given the supervised part of SERAPH, regularized information maximization would suggest
formula
where we assume the regularization parameter satisfies . Then by decomposing , it could be rewritten as
formula
The entropy term encourages a balanced prior distribution of y under the metric . However, the numbers of similar and dissimilar data pairs (i.e., y=+1 and y=−1) are inherently imbalanced in all metric learning problem settings. Therefore, we simply drop the regularization term and attain optimization 2.6.
Note that this explanation elicits a nice heuristic value of the regularization parameter,
formula
In fact, let be the conditional entropy of the weak label on the data pair
formula
Then can be estimated by
formula
As a result, the conditional entropy as the supervised part and the mutual information as the unsupervised part become equally important in and if setting .

5.  Related Work

Xing et al. (2003) initiated research on metric learning based on pairwise similarity and dissimilarity constraints by global distance metric learning (GDM). Several excellent metric learning methods have been developed in the past decade, including neighborhood component analysis (NCA) (Goldberger et al., 2005), large-margin nearest-neighbor classification (LMNN) (Weinberger et al., 2006), and information-theoretic metric learning (ITML) (Davis, Kulis, Jain, Sra, & Dhillon, 2007).

Both ITML and SERAPH are information theoretic, but the ideas and models are quite different. ITML defines a generative gaussian model,
formula
where is the unknown mean value, Z is a normalizing constant, and both can be canceled out in the constrained optimization. Compared with GDM, ITML regularizes the Kullback-Leibler divergence between and pA(x), where A0 is the prior metric, and then transforms this term to a log-det regularization. By specifying , it becomes the maximum entropy estimation of pA(x). Thus, it prefers the distance metric close to the Euclidean distance. The supervised part of SERAPH also follows the maximum entropy principle, but the probabilistic model is discriminative.

A probabilistic GDM was designed intuitively as a baseline method in the experimental part of Yang et al. (2006). It can be viewed as a special case of our supervised part, but the final model of SERAPH is much more general. (For details, see sections 2.2, 7.1 and 7.2.)

Due to the limitation of supervised metric learning when few labeled data are available, semi-supervised models and algorithms that incorporate off-the-shelf unsupervised techniques to existing supervised approaches have been proposed. Local distance metric learning (LDM) (Yang et al., 2006) is the pioneer. Unlike later manifold-based methods, it embeds the unsupervised information by assuming that the eigenvectors of the optimal A are the principal components of all training data. Hoi et al. (2008) borrows the idea of Laplacian eigenmaps (Belkin & Niyogi, 2002) and combines manifold regularization to the min-max principle of GDM. Baghshah and Shouraki (2009) then show that Fisher discriminant analysis can be regularized by locally linear embedding (Roweis & Saul, 2000), and the resulting manifold Fisher discriminant analysis (MFDA) is extremely computationally efficient. Liu et al. (2010) brings the element-wise matrix sparsity of A to Hoi et al. (2008). In general, any unsupervised embedding method that preserves the local neighborhood information can be modified into a semi-supervised extension.7

The manifold extension described in section 7.2 is so general that it can be attached to all metric learning methods, whereas our information-theoretic extension can only be applied to probabilistic metric learning methods. Nevertheless, any probabilistic method with an explicit expression of the posterior distribution such as NCA, LDM, and SERAPH can have two semi-supervised extensions, while deterministic methods like GDM, LMNN, and MFDA cannot benefit from our semi-supervised extension. ITML used a generative gaussian model whose parameters are not estimated by the algorithm, so it is nontrivial to apply our extension to it.

Here we leave out sparse metric learning and robust metric learning, and instead, recommend Huang, Jin, Xu, & Liu (2010) and Huang, Ying, and Campbell (2009) for reviews of sparse and robust metric learning. Moreover, a comprehensive literature survey on metric learning, (Bellet, Habrard, and Sebban (2013), is available online and is a good reference.

6.  Experiments

In this section, we numerically evaluate the performance of metric learning methods.

6.1.  Setup.

In our experiments, we compared SERAPH with six representative metric learning methods (plus the Euclidean distance):

  • • 

    Global distance metric learning (GDM; Xing et al., 2003)8

  • • 

    Neighborhood component analysis (NCA; Goldberger et al., 2005)9

  • • 

    Large-margin nearest-neighbor classification (LMNN; Weinberger et al., 2006)10

  • • 

    Information-theoretic metric learning (ITML; Davis et al., 2007)11

  • • 

    Local distance metric learning (LDM; Yang et al., 2006)12

  • • 

    Manifold Fisher discriminant analysis (MFDA; Baghshah & Shouraki, 2009)13.

GDM, NCA, LMNN and ITML are supervised methods, while LDM and MFDA are semi-supervised methods. SERAPH, as well as GDM, ITML, and LDM, use the global metric information; and NCA, LMNN, and MFDA benefit from the local metric information.

Table 1 describes the specification of benchmark data sets in our experiments. The top six data sets (Iris, Wine, Ionosphere, Balance, Breast Cancer, and Diabetes) come from the UCI Machine Learning Repository,14 and USPS and MNIST are available at the homepage of the late Sam Roweis.15 The gray-scale images of handwritten digits in USPS were downsampled to pixel resolution, resulting in 64-dimensional vectors. Similarly, the gray-scale images in MNIST were downsampled to pixel resolution, resulting in 196-dimensional vectors. The symbol USPS1−5, 20 means 20 training data from each of the first 5 classes, USPS1−10, 40 means 40 training data from each of all 10 classes, MNIST1, 7 means digits 1 versus 7, and so forth. Note that in the last two tasks, the dimensionality of data is greater than the number of all training data: the number of parameters to be learned in A is m(m+1)/2=19, 306, whereas the number of training data points ntrain is 100 or 150, and then the number of training data pairs ntrain(ntrain−1)/2 is only 4950 or 11,175.

Table 1:
Specification of Benchmark Data Sets.
cmntrainntestnlabel
Iris 100 38 10 15.10 29.90 4905 
Wine 13 100 78 10 13.98 31.02 4905 
Ionosphere 34 100 251 20 97.50 92.50 4760 
Balance 100 465 10 20.38 24.62 4905 
Breast cancer 30 100 469 10 23.54 21.46 4905 
Diabetes 100 668 10 23.02 21.98 4905 
 c m ntrain ntest nlabel    
USPS1−5,20 64 100 2500 10 40 4905 
USPS1−5,40 64 200 2500 20 30 160 19,710 
USPS1−10, 20 10 64 200 2500 20 10 180 19,710 
USPS1−10,40 10 64 400 2500 40 60 720 79,020 
MNIST1,7 196 100 1000 4944 
MNIST3,5,8 196 150 1500 27 11,139 
cmntrainntestnlabel
Iris 100 38 10 15.10 29.90 4905 
Wine 13 100 78 10 13.98 31.02 4905 
Ionosphere 34 100 251 20 97.50 92.50 4760 
Balance 100 465 10 20.38 24.62 4905 
Breast cancer 30 100 469 10 23.54 21.46 4905 
Diabetes 100 668 10 23.02 21.98 4905 
 c m ntrain ntest nlabel    
USPS1−5,20 64 100 2500 10 40 4905 
USPS1−5,40 64 200 2500 20 30 160 19,710 
USPS1−10, 20 10 64 200 2500 20 10 180 19,710 
USPS1−10,40 10 64 400 2500 40 60 720 79,020 
MNIST1,7 196 100 1000 4944 
MNIST3,5,8 196 150 1500 27 11,139 

Note: For each data set, c is the number of classes, m is the number of features, ntrain/ntest is the number of training/test data points, and nlabel is the number of class labels to construct and .

All metric learning methods were run repeatedly on 50 random samplings of a given task. For each random sampling, we constructed and , which include the similar and dissimilar data pairs for training, according to the class labels of the first few data points for training: Let yi and yj be the class labels of xi and xj; then

  • • 

    and yi,j=+1 if yi=yj.

  • • 

    and yi,j=−1 if .

The sizes of and were dependent on the specific random sampling of each UCI task but fixed for all samplings of each USPS and MNIST task. We measured the performance of the one-nearest-neighbor classifiers based on the learned metrics and the computation time for learning the metrics, where the “training data” for our classifiers included only the few data points having class labels.

For SERAPH, we fixed for simplicity. Then four hyperparameter settings were considered:

  • • 

    SERAPHnone stands for and .

  • • 

    SERAPHpost stands for and .

  • • 

    SERAPHproj stands for and .

  • • 

    SERAPHhyper stands for and .

There was no cross-validation for each random sampling, because we would like the metrics learned by different methods to be independent of those nearest-neighbor classifiers whose performance had a large deviation given limited supervised information.16 The hyperparameters of other methods (e.g., the number of reduced dimensions, the number of nearest neighbors, as well as the percentage of principal components) were selected as the best candidate value based on another 10 random samplings if no default or heuristic value was provided by the original authors of the codes.

6.2.  Results.

6.2.1.  Artificial Data Sets.

Figures 2 and 3 displayed the visually comprehensive results of the posterior and projection sparsity regularization on two artificial data sets respectively. More specifically,

  • • 

    Figures 2c, 2d, 3c, and 3d were generated with .

  • • 

    Figures 2e and 2f were generated with and .

  • • 

    Figures 3a and 3f were generated with and ,

where the gradient projection algorithm was used. We can see from Figures 2 and 3 that the sparsity regularization can dramatically improve the generalized maximum entropy estimation.

6.2.2.  Gradient Projection Algorithm versus. EM-Like Algorithm.

Before comparing the proposed SERAPH with other metric learning methods, we evaluated the gradient projection algorithm (GP) and the EM-like iterative algorithm (EM). Table 2 shows their performance where the hyperparameter setting SERAPHhyper was used. By the paired t-test at the 5% significance level, GP and EM both won two times and tied eight times, and therefore GP and EM are basically comparable as two different solvers to the same optimization problem. However, EM was computationally more efficient than GP in our experiments consistently and the difference of their average computation time was remarkable, which suggests that, EM as an indirect solver could be a good alternative to the direct solver GP.

Table 2:
Gradient Projection Algorithm versus EM-Like Algorithm (EM).
Computation
GPEMPaired t-TestTime Ratio
Iris   Tie 3.00 
Wine   Tie 2.94 
Ionosphere   Tie 2.04 
Balance   Tie 1.94 
Breast cancer   Tie 1.43 
Diabetes   Tie 1.67 
USPS1−5,20   GP win 2.58 
USPS1−5,40   Tie 2.77 
USPS1−10, 20   EM win 2.81 
USPS1−10,40   EM win 2.43 
MNIST1,7   GP win 1.23 
MNIST3,5,8   Tie 1.99 
Computation
GPEMPaired t-TestTime Ratio
Iris   Tie 3.00 
Wine   Tie 2.94 
Ionosphere   Tie 2.04 
Balance   Tie 1.94 
Breast cancer   Tie 1.43 
Diabetes   Tie 1.67 
USPS1−5,20   GP win 2.58 
USPS1−5,40   Tie 2.77 
USPS1−10, 20   EM win 2.81 
USPS1−10,40   EM win 2.43 
MNIST1,7   GP win 1.23 
MNIST3,5,8   Tie 1.99 

Notes: Means with standard errors of the nearest-neighbor misclassification rate (in %) are shown, together with results of the paired t-test at the significance level 5%. The computation time ratio means the average computation time of GP over that of EM.

6.2.3.  Benchmark Data Sets.

The experimental results in terms of the nearest-neighbor misclassification rate are reported in Table 3, where the EM-like algorithm was used. GDM was very slow for high-dimensional data and was excluded from the comparison. SERAPH was fairly promising, especially the hypersparsity setting (i.e., and ). It was best or tied over all 12 tasks. It often statistically significantly outperformed other methods except ITML on six UCI data sets, and it was superior to all other competitors, including SERAPHpost and SERAPHproj in four USPS tasks. Furthermore, it successfully improved its accuracy even in two ill-posed MNIST tasks. To sum up, SERAPH can reduce the risk of overfitting weakly labeled data with the help of unlabeled data, and hence our sparsity regularization would be reasonable and practical.

Table 3:
Means with Standard Errors of the Nearest-Neighbor Misclassification Rate (in %) on UCI, USPS, and MNIST Benchmarks.
IrisWineIonosphereBalanceBreast CancerDiabetes
EUCLIDEAN       
GDM       
NCA       
LMNN       
ITML       
LDM       
MFDA       
SERAPHnone       
SERAPHpost       
SERAPHproj       
SERAPHhyper       
 USPS1−5,20 USPS1−5,40 USPS1−10,20 USPS1−10,40 MNIST1,7 MNIST3,5,8 
EUCLIDEAN       
GDM  
NCA       
LMNN       
ITML       
LDM       
MFDA       
SERAPHnone       
SERAPHpost       
SERAPHproj       
SERAPHhyper       
IrisWineIonosphereBalanceBreast CancerDiabetes
EUCLIDEAN       
GDM       
NCA       
LMNN       
ITML       
LDM       
MFDA       
SERAPHnone       
SERAPHpost       
SERAPHproj       
SERAPHhyper       
 USPS1−5,20 USPS1−5,40 USPS1−10,20 USPS1−10,40 MNIST1,7 MNIST3,5,8 
EUCLIDEAN       
GDM  
NCA       
LMNN       
ITML       
LDM       
MFDA       
SERAPHnone       
SERAPHpost       
SERAPHproj       
SERAPHhyper       

Note: For each data set, the best method and comparable ones based on the unpaired t-test at the significance level 5% are in bold.

In vivid contrast with SERAPH that exhibited this generalization capability, supervised methods might learn a metric even worse than the Euclidean distance due to overfitting problems—especially NCA, which optimized the expected leave-one-out classification error on a limited amount of labeled data. The powerful LMNN did not behave satisfyingly, since it was hardly fulfilled finding a lot of neighbors belonging to the same class within labeled data. ITML worked very well despite the fact that it can access only weakly labeled data, but it became less useful for high-dimensional data. We observed that LDM might fail when the principal components of all training data were not close to the eigenvectors of the optimal matrix being learned, and MFDA might fail if the training data cannot afford to recover the data manifold.

An observation is that the methods using the global metric information usually outperformed those using the local metric information in our experiments since the supervised information was insufficient, which is opposite to the phenomena observed in supervised metric learning problem settings. It indicates that the methods using the local metric information tend to fit the given information too much and suffer from overfitting problems, since the local metric information always focuses on a small quantity of data in a local neighborhood and thus has a relatively large deviation.

6.2.4.  Computational Efficiency.

Figure 4 summarizes the corresponding experimental results in terms of the average computation time (GDM was excluded from the comparison due to its low speed). The computation time was measured in seconds and drawn in a logarithmic scale with 10 as the base. The shortest average computation time was 0.1677 second of MFDA for Iris, and the longest time was 3023 seconds of LMNN for MNIST3,5,8. Generally SERAPH (when the EM-like algorithm was used) was the second most computationally efficient method. The most computationally efficient method MFDA, consists of just two steps: solve a linear system of locally linear embedding (Roweis & Saul, 2000) and then solve a generalized eigenvalue problem as Fisher discriminant analysis (Fisher, 1936). Improvements may be expected if we program in Matlab with C/C++ as NCA and LMNN.

Figure 4:

Average computation time of different metric learning methods on UCI, USPS, and MNIST benchmarks. The computation time was measured in seconds and drawn in a logarithmic scale with 10 as the base.

Figure 4:

Average computation time of different metric learning methods on UCI, USPS, and MNIST benchmarks. The computation time was measured in seconds and drawn in a logarithmic scale with 10 as the base.

6.2.5.  Sensitivity to Regularization Parameters.

Recall that there was no cross-validation within each random sampling, so it would be helpful to test the sensitivity of SERAPH to the regularization parameters and . Six benchmark data sets were included:

  • • 

    Diabetes, Iris, and Ionosphere, on which SERAPHnone, SERAPHpost, and SERAPHproj had the lowest means of the nearest-neighbor misclassification rate in Table 3, respectively.

  • • 

    Balance, USPS1−5,20, and MNIST1,7, on which SERAPHhyper had lowest means of the nearest-neighbor misclassification rate in Table 3.

We considered geometrically progressed candidates for both and ranging from 2−3 to 2+3 with 20.5 as the factor,
formula
and the actual regularization parameter being used was . The gradient projection algorithm was repeatedly run on 10 random samplings, which were the first 10 random samplings of those 50 random samplings, given all combinations of and . The resulting contours are displayed in Figure 5.
Figure 5:

Contours of the mean misclassification rates (in %) of the one-nearest-neighbor classifiers based on the metrics learned by SERAPH given different regularization parameters and . The actual regularization parameter being used was .

Figure 5:

Contours of the mean misclassification rates (in %) of the one-nearest-neighbor classifiers based on the metrics learned by SERAPH given different regularization parameters and . The actual regularization parameter being used was .

We can see that SERAPH worked well in large areas of the contour plots in Figure 5, and we can clearly observe two phenomena.

First, SERAPH was more sensitive to for the low-dimensional tasks (Diabetes, Iris, Ionosphere, and Balance), and the learned metrics became worse when became large. Even for USPS1−5,20, the learned metrics also became worse when became large for small . Note that large implies the strong regularization on the trace norm of A, which upper-bounds the rank of A, and the rank of A ultimately controls the number of parameters to be learned in A. This explains why the contours of MNIST1,7 were different from others, as there were so many parameters in A that large did not make SERAPH significantly overregularized.

Second, SERAPH was also sensitive to for the high-dimensional tasks USPS1−5,20 and MNIST1,7, while this time when became large, the learned metrics became worse for small but better for large . In other words, SERAPH easily got overregularized by emphasizing unlabeled data too much if the number of parameters to be learned in A was improperly large, whereas it hardly got overregularized if the number of parameters was properly small. Additionally, for USPS1−5,20 and MNIST1,7, neither the posterior sparsity nor the projection sparsity worked alone, but they became very powerful after they were integrated into the hypersparsity. A final caveat is that the hypersparsity could never be a panacea for such high-dimensional tasks and we should not employ it too much. The contours of MNIST1,7 have exhibited a typical effect that the learned metrics became worse suddenly and rapidly along the line when became very large.

7.  Extensions

In this section, we explain the kernel and manifold extensions of SERAPH. The technique for kernelizing a metric learning method was originally proposed in Jain, Kulis, and Dhillon (2010), and the technique for manifold regularizing a metric learning method was originally proposed in Hoi et al. (2008).

7.1.  Kernel Extension.

Suppose that we have a kernel function with the feature map such that . Consider learning a Mahalanobis distance metric for of the form
formula
7.1
where is a symmetric positive semi-definite matrix to be learned. However, it is impractical or impossible to learn W directly, since is often very large and possibly infinite. In order to learn W indirectly, we rewrite optimization 3.1 with respect to W:
formula
where is similar to equation 3.2:
formula
Subsequently, according to Jain et al. (2010), any optimal solution will be in the form of , where is the design matrix obtained by applying to the training set and is an optimal solution to
formula
which is actually optimization 3.1 with respect to A but in equation 3.2 is replaced with
formula
7.2
Next let us simplify our notations to remove the feature map from our equations. We introduce the empirical kernel map (Schölkopf & Smola, 2001) defined by
formula
and then equations 7.1 and 7.2 can be expressed by
formula
Moreover, let be the kernel matrix and be the columns of K. Then for any :
formula
All components of SERAPH remain the same after replacing xi with the corresponding ki. The resultant Mahalanobis distance metric will be highly nonlinear with respect to the original input data domain.
The experimental results based on the kernel extension are reported in Table 4, where the EM-like algorithm was used; for convenience, the best hyperparameter setting in Table 3 is also listed in Table 4. The four hyperparameter settings were same as before, but here they were kernelized, with the symbol “ker” in front of them. More specifically, three kernels were involved: the linear kernel was used for Iris, the gaussian kernel for other UCI data sets, and the sparse variant of the cosine kernel for USPS and MNIST. The linear kernel is
formula
The gaussian kernel is
formula
with a hyperparameter , and was set to the median pairwise distance, that is, the median value of the Euclidean distances between all training data pairs. Note that we just need to compute the empirical kernel map in which the first argument of the kernel k must be from , and hence the sparse variant of the cosine kernel is
formula
7.3
with a hyperparameter , where means that xi is one of the nearest neighbors of x in , and was set to 11 so that 10 nearest neighbors were found for except itself. We can see from Table 4 that SERAPH still performed well—and even better after applying the kernel extension. Among all 12 tasks on the UCI, USPS, and MNIST data sets, the records were improved by the kernel extension in seven tasks, and the improvement was significant under the paired t-test at the 5% significance level. We may roughly compare the experimental results in Table 4 with similar results in Jain et al. (2010)17 and Wang, Do, Woznica, and Kalousis (2011), while we should be aware that there were many fewer training data for SERAPH as well as for the following nearest-neighbor classifiers, and a single kernel was also much weaker than multiple kernels.
Table 4:
Means with Standard Errors of the Nearest-Neighbor Misclassification Rate (in %) Based on the Kernel Extension.
Table 3 Bestker noneker postker projker hyper
Iris      
Wine      
Ionosphere      
Balance      
Breast Cancer      
Diabetes      
USPS1−5,20      
USPS1−5,40      
USPS1−10,20      
USPS1−10,40      
MNIST1,7      
MNIST3,5,8      
Table 3 Bestker noneker postker projker hyper
Iris      
Wine      
Ionosphere