## Abstract

We propose a general information-theoretic approach to semi-supervised metric learning called SERAPH (SEmi-supervised metRic leArning Paradigm with Hypersparsity) that does not rely on the manifold assumption. Given the probability parameterized by a Mahalanobis distance, we maximize its entropy on labeled data and minimize its entropy on unlabeled data following entropy regularization. For metric learning, entropy regularization improves manifold regularization by considering the dissimilarity information of unlabeled data in the unsupervised part, and hence it allows the supervised and unsupervised parts to be integrated in a natural and meaningful way. Moreover, we regularize SERAPH by trace-norm regularization to encourage low-dimensional projections associated with the distance metric. The nonconvex optimization problem of SERAPH could be solved efficiently and stably by either a gradient projection algorithm or an EM-like iterative algorithm whose M-step is convex. Experiments demonstrate that SERAPH compares favorably with many well-known metric learning methods, and the learned Mahalanobis distance possesses high discriminability even under noisy environments.

## 1. Introduction

How to learn a good distance metric for the input data domain is a crucial issue for many distance-based learning algorithms. The goal of metric learning is to find a new metric under which “similar” data are close and “dissimilar” data are far apart (Xing, Ng, Jordan, & Russell, 2003). The great majority of metric learning methods developed in the last decade fall into three types:

Supervised type requiring class labels (Chiaromonte & Cook, 2002; Sugiyama, 2007; Fukumizu, Bach, & Jordan, 2009).

^{1}Two data points with the same label are regarded as similar, and those with different labels are regarded as dissimilar.Supervised type requiring weak labels, that is, -valued labels that indicate the similarity and dissimilarity of data pairs directly (Xing, Ng, Jordan, & Russell, 2003; Goldberger, Roweis, Hinton, & Salakhutdinov, 2005; Weinberger, Blitzer, & Saul, 2006; Globerson & Roweis, 2006; Torresani & Lee, 2007; Davis, Kulis, Jain, Sra, & Dhillon, 2007). See Figure 1.

Unsupervised type that requires no label information (Roweis & Saul, 2000; Tenenbaum, de Silva, & Langford, 2000; Belkin & Niyogi, 2002). Unlike previous types, the similarity and dissimilarity here are extracted from data instead of being given as supervision.

The second type has been extensively studied, since weak labels are much cheaper than class labels when the number of classes is fairly large. That said, supervised metric learning based on weak labels still has a strict limitation. Algorithms of this type need each data point to be involved in at least one weak label; otherwise these algorithms cannot see that data point because it never exists. This limitation is often problematic for real-world applications and needs to be fixed.

Based on the belief that preserving the geometric structure of all labeled and unlabeled data in an unsupervised manner can be better than strongly relying on the limited labeled data, semi-supervised metric learning has emerged. To the best of our knowledge, all previous semi-supervised methods that extend types 1 and 2 employ off-the-shelf unsupervised techniques in type 3—for example:

More specifically, they rely on the manifold assumption and implement the following:

- •
If two data points are near each other under the original metric, pull them so they not are not far away under the new metric.

- •
If two data points are far from each other under the original metric, do nothing.

In the second case, we should not push the two data points farther apart under the new metric, since they may be connected by the data manifold and should be close together under the new metric even though they were originally far apart. By implementing these two cases, those semi-supervised methods successfully extract the similarity information of unlabeled data.

However, there remain two issues. First, the methods ignore the dissimilarity information of unlabeled data. This can be a huge waste of information, since most unlabeled data pairs would be dissimilar if the number of underlying classes is large and the classes are balanced. To this end, an appealing semi-supervised metric learning method should be able to make use of the dissimilarity information of unlabeled data. Second, the similarity of unlabeled data extracted by those methods is measured by closeness under the original metric, and it is inconsistent with the similarity of labeled data. Recall that metric learning aims at finding a new metric, and weak labels indicating similar but far apart data pairs are in principle the most informative ones. Therefore, under the original metric, closeness is not the reason for the similarity of labeled data, whereas it is the reason for the similarity of unlabeled data. In contrast, similarity and closeness generally imply each other for both labeled and unlabeled data under the new metric. To this end, an appealing method should focus on the new metric when extracting the similarity information of unlabeled data. In fact, the unsupervised parts of the existing methods that rely on the manifold assumption and implement the two cases (if two data points are close and if they are far apart) are inconsistent with their supervised parts in terms of these two issues. Simply putting them together works in practice, but this paradigm is conceptually neither natural nor unified.

In this letter, we propose a general information-theoretic approach to semi-supervised metric learning called SERAPH (SEmi-supervised metRic leArning Paradigm with Hyper-sparsity) in order to address these issues. It extracts not only the similarity information but also the dissimilarity information of unlabeled data, and to do so, it accesses the new metric rather than the original one. Our idea is to optimize a new Mahalanobis distance metric through optimizing a conditional probability parameterized by that metric. We maximize the entropy of this probability on labeled data pairs and minimize the entropy of this probability on unlabeled data pairs following entropy regularization (Grandvalet & Bengio, 2005), which can achieve the sparsity of the posterior distribution (Graça, Ganchev, Taskar, & Pereira, 2009; Gillenwater, Ganchev, Graça, Pereira, & Taskar, 2011) that is, unlabeled data pairs can be classified with high confidence. Furthermore, we employ mixed-norm regularization (Argyriou, Evgeniou, & Pontil, 2007) to encourage the sparsity of projection matrices associated with the new metric in terms of their singular values (Ying, Huang, & Campbell, 2009), and the new metric can carry out dimensionality reduction implicitly and adaptively. Unifying the posterior sparsity and the projection sparsity brings to us the hypersparsity. Thanks to this hypersparsity, the new metric learned by SERAPH possesses high discriminability even under noisy environments.

We make three contributions with this work. First, we formulate supervised metric learning based on weak labels as an instance of the generalized maximum entropy distribution estimation (Dudík & Schapire, 2006). Second, we propose an extension of this estimation to semi-supervised metric learning via entropy regularization (Grandvalet & Bengio, 2005). It considers the dissimilarity information of unlabeled data based on the Mahalanobis distance being learned. Third, we develop two ways to solve the nonconvex optimization problem involved in this extension: a direct gradient projection algorithm and an indirect EM-like iterative algorithm.

The rest of this letter is organized as follows. The SERAPH model is formulated in section 2, and then two algorithms are developed in section 3 to solve the optimization problem involved in the model. In section 4, we discuss three forms of sparsity and two additional justifications of the model. A comparison with related work is made in section 5. Experimental results are reported in section 6. In section 7, we offer two extensions to SERAPH. We give concluding remarks and future work in section 8.

## 2. SERAPH, the Model

In this section, we first propose the supervised part of the SERAPH model as a generalized maximum entropy estimation for supervised metric learning based on weak labels and then introduce two additional regularization terms via entropy regularization and trace-norm regularization.

### 2.1. Problem Setting.

*n*points, each with

*m*features. Let the set of similar data pairs be and the set of dissimilar data pairs be With some abuse of terminology, we refer to as the labeled data and as the unlabeled data. A weak label

*y*

_{i,j}is assigned to (

*x*,

_{i}*x*) such that We abbreviate , and to , and , respectively, for simplicity. Consider learning a Mahalanobis distance metric for of the form where is the transpose operator and is a symmetric positive semi-definite matrix to be learned.

_{j}^{2}The probability of labeling with is denoted by which is explicitly parameterized by the matrix

*A*. When the pair comes from , is abbreviated as

*p*

^{A}_{i,j}(

*y*).

### 2.2. Basic Model.

We derive a probabilistic model to investigate the conditional probability of given . We use a parametric form of and will focus on this form because it is optimal in the following sense.

^{3}be the entropy of the conditional probability

*p*

^{A}_{i,j}(

*y*), and be a feature function that is convex with respect to

*A*. Then the constrained optimization problem is where is a slack variable and is a regularization parameter. After the introduction of , distributions are allowed to match two data moments in a way that is not strictly exact. The penalty in the objective function presumes the gaussian prior of the expected data moment, from the empirical data moment, which is consistent with the generalized maximum entropy principle (Dudík & Schapire, 2006). (See section 4.2 for an alternative explanation of optimization 2.2 in the sense of the generalized maximum entropy principle, particularly the need to introduce the slack variable from a theoretical point of view.)

^{4}

^{5}where is a hyperparameter that serves as the threshold to separate the similar and dissimilar data pairs in and under the new metric . Now the probabilistic model 2.3 becomes For the optimal solution and reasonable , we hope for two properties:

- •
- •

Therefore, there must be .

*y*and are binary-valued vectors.

### 2.3. Regularization.

In this section, we extend defined above to semi-supervised metric learning via entropy regularization and further regularize it by trace-norm regularization.

*p*

^{A}_{i,j}(

*y*) should have low entropy (which in turn means low uncertainty) for unlabeled data . Generally the resultant discriminative probabilistic models prefer peaked distributions on unlabeled data such that unlabeled data can be classified with high confidence, which can carry out a probabilistic low-density separation. Subsequently, according to Grandvalet and Bengio (2005), our optimization becomes where is a regularization parameter.

*A*. It would be helpful in dealing with corrupted data or data distributed intrinsically in a low-dimensional subspace. It is known that the trace is a convex relaxation of the rank for positive semi-definite matrices, so we revise our optimization problem into where tr(

*A*) is the trace of

*A*and is a regularization parameter.

The optimization problem, equation 2.6, is the final model of SERAPH. We say that it is equipped with hypersparsity when both and are positive and hence both regularization terms are active. The hypersparsity, as well as the posterior and projection sparsity, will be discussed in section 4.1. Moreover, SERAPH possesses standard kernel and manifold extensions, which we explain in sections 7.1 and 7.2, respectively.

## 3. SERAPH, the Algorithm

In this section, we reduce optimization 2.6 to a form that is easy to handle, and develop two practical algorithms for solving the reduced optimization.

### 3.1. Reduction.

While optimization 2.6 involves a dual variable , we would like to focus on the variable *A* just as many previous metric learning methods have. Theorem 2 guarantees that we can eliminate from equation 2.6 to get an equivalent but simpler optimization, thanks to the fact that we use a single feature function, equation 2.5, in optimization 2.2.

After the reduction of theorem 2, has been dropped, and have been modified, but the regularization parameter remains the same, which means that the trade-off between the supervised and unsupervised parts has not been affected.

### 3.2. Two Algorithms.

There are several approaches for solving optimization 3.1, for example, gradient projection and expectation maximization (Grandvalet & Bengio, 2006). By no means can an approach always be better than another for nonconvex optimizations. Hence, we explore both of them and find they can solve equation 3.1 efficiently and stably.

*t*th M-step, we find a new metric

*A*

^{(t)}through a surrogate optimization, where is generated in the last E-step. Since the feature function is convex with respect to

*A*, the objective function is concave with respect to

*A*, and optimization 3.6 is convex according to Boyd and Vandenberghe (2004). Thus, we could solve optimization 3.6 using the gradient projection method without worrying about local maxima, where the gradient matrix is At the

*t*th E-step, we update for each pair as where

*p*

^{A}_{i,j}(

*y*) is parameterized by

*A*

^{(t)}found in the last M-step. Although this algorithm may not converge, it works fairly well in practice. No matter how we design the M-step, it is insensitive to the step size of the gradient update, and it gives a deterministic solution after fixing the initial solution and the stopping conditions. In other words, the EM-like iterative algorithm can easily be derandomized by the initial solution and the stopping conditions, which is a nice algorithmic property for nonconvex optimizations.

Details of the implementation are in appendix B.

### 3.3. Theoreticalc Analyses.

*O*(

*n*

^{2}

*m*+

*m*

^{3}), where

*n*is the number of data and

*m*is the number of features (recall that the training set contains

*n*points, each with

*m*features). Each iteration of the gradient projection algorithm consumes

*O*(

*n*

^{2}

*m*+

*nm*

^{2}) for the gradient update and

*O*(

*m*

^{3}) for the projection, which has an asymptotic time complexity

*O*(

*n*

^{2}

*m*+

*m*

^{3}), since

*O*(

*nm*

^{2}) could never dominate

*O*(

*n*

^{2}

*m*) and

*O*(

*m*

^{3}) simultaneously. Additionally, it is common to set in advance a maximum number of iterations

*T*

_{GP}for such a nonconvex optimization solver, and the overall asymptotic time complexity of the gradient projection algorithm is For the EM-like iterative algorithm, each iteration of the M-step is same as the gradient projection algorithm, and each E-step costs

*O*(

*n*

^{2}), which is negligible compared with the computational complexity of the full M-step. As a consequence, the overall asymptotic time complexity of the EM-like algorithm is where is the maximum number of iterations of the M-step and

*T*

_{EM}is the maximum number of iterations of the EM-like algorithm.

It is obvious that which algorithm is empirically faster depends primarily on which of *T*_{GP} or is smaller. In fact, the gradient projection method for equation 3.6 is much easier than for equation 3.1 since equation 3.6 is a convex optimization, which means the M-step of the EM-like algorithm itself is much easier than the gradient projection algorithm. Furthermore, it is unnecessary to solve the M-step exactly in such an EM-like algorithm. As a result, is supposed to be significantly smaller than *T*_{GP}. On the other hand, the temporary *A*^{(t)} of EM-like iterations makes up a deterministic sequence for fixed initial , and a small *T*_{EM} is usually enough for finding a reasonable solution. To sum up, we can set to be smaller than *T*_{GP} in practice and then expect the EM-like algorithm to be faster than the gradient projection algorithm with comparable qualities of the learned distance metrics.

The gradient projection and EM-like algorithms are not only computationally efficient but also computationally stable. The following theorem shows that the gradient matrices of and given in equations 3.5 and 3.7 are uniformly bounded, regardless of the scale of *A*, that is, the magnitude of tr(*A*). It also implies that compared with maximizing , maximizing should be stabler even without considering that is a concave function.

*t*=0, becomes the hard assignments This is the reason for initializing in our current implementation.

## 4. Discussion

We left out a few theoretical arguments when we proposed the SERAPH model in order to keep the presentation as concise and comprehensible as possible. In this section, we discuss the sparsity issue in the sense of metric learning and present two additional justifications for our model.

### 4.1. Posterior Sparsity and Projection Sparsity.

Sparse metric learning might have different meanings, since we learn a metric with low-rank linear projections by optimizing a conditional probability, where the optimization variable is actually a square matrix. First, we explain the meaning of our sparsity and claim that we can obtain the posterior sparsity (Graça et al., 2009) by entropy regularization and the projection sparsity (Ying et al., 2009) by trace-norm regularization. The arguments are as follows.

By a “sparse” posterior distribution, we mean that the uncertainty (e.g., the entropy or variance) of *p ^{A}*

_{i,j}(

*y*) for is low, such that (

*x*,

_{i}*x*) can be classified as a similar or dissimilar pair with high confidence. Figure 2 is an illustrative example. Recall that supervised metric learning aims at finding a new distance metric under which data in the same class are close and data from different classes are far apart. It would result in the metric that ignores the horizontal feature and focuses on only the vertical feature. Nevertheless, the horizontal feature is useful, and taking care of the posterior sparsity would lead to a better metric as shown in Figures 2e and 2f. As a consequence, we prefer taking the posterior sparsity into account in addition to the goal of supervised metric learning, and then the risk of overfitting weakly labeled data can be significantly reduced.

_{j}*q*is unconstrained, we can optimize

*q*with respect to fixed

*A*and . It is easy to see that

*q*should be

*p*restricted on , so the KL divergence term is zero and the expectation term is the entropy, which implies the equivalence of optimizations 4.1 and 4.2.

^{A}*M*as Similarly to Ying et al. (2009), let be a linear projection, be the symmetric positive semi-definite matrix of the metric induced from

*P*, and

*P*and

_{i}*W*be the

_{i}*i*th columns of

*P*and

*W*. If

*P*is identically zero, the

_{i}*i*th component of

*x*has no contribution to

*z*=

*Px*. Since the column-wise sparsity of

*W*and

*P*is equivalent, we can reach the column-wise sparsity of

*P*by penalizing . Nevertheless, this is the ability of feature selection rather than dimensionality reduction. Note that the goal is to select a few most representative directions of input data that are not restricted to the coordinate axes. The solution is to pick an extra transformation to rotate

*x*before projecting

*x*where is the set of orthonormal matrices of size

*m*. Consequently, we penalize , project

*x*to

*z*=

*PVx*, and since , we arrive at Remember that the final model of SERAPH was given by optimization 2.6 as The equivalence of optimizations 2.6 and 4.3 is guaranteed by lemma 1 of Ying et al. (2009). By unifying the posterior sparsity and the projection sparsity mentioned above, we obtain a property that we call the

*hypersparsity*.

### 4.2. Generalized Maximum Entropy Principle.

The basic model defined in optimization 2.2 contains an inequality constraint instead of some equality constraint, since the regularization term in is indispensable. Otherwise we would have for the optimal solution , which means that the optimization would be degenerated, and the learned metric might easily overfit weakly labeled data. This phenomenon is owing to the single-point prior of the expected data moment from the empirical data moment. The regularization term reflects the gaussian prior in the generalized maximum entropy principle (Dudík & Schapire, 2006), while the ordinary maximum entropy principle (Jaynes, 1957; Berger et al., 1996) assumes the single-point prior and applies no regularization on the dual variable.

*u*be Redefine optimization 2.2 as an equivalent form where the equivalence is due to Fenchel's duality theorem of Dudík and Schapire (2006) plus the fact that the conjugate of

_{f}*U*(

_{f}*x*) is . Subsequently, is an optimization problem with two potential functions and under the posterior regularization framework (Graça, Ganchev, & Taskar, 2008; Graça et al. 2009; Bellare, Druck, & McCallum, 2009, Gillenwater et al., 2011), and hence SERAPH can be viewed as a semi supervised maximum entropy estimation equipped with the additional projection sparsity.

### 4.3. Information Maximization Principle.

The final model defined in optimization 2.6 can also be viewed as an information-maximization approach to semi-supervised metric learning based on weak labels. The regularized information-maximization framework (Gomes, Krause, & Perona, 2010) advocates the preference for maximizing the mutual information between data and labels as well as the need to regularize the model parameters.

*y*under the metric . However, the numbers of similar and dissimilar data pairs (i.e.,

*y*=+1 and

*y*=−1) are inherently imbalanced in all metric learning problem settings. Therefore, we simply drop the regularization term and attain optimization 2.6.

## 5. Related Work

Xing et al. (2003) initiated research on metric learning based on pairwise similarity and dissimilarity constraints by global distance metric learning (GDM). Several excellent metric learning methods have been developed in the past decade, including neighborhood component analysis (NCA) (Goldberger et al., 2005), large-margin nearest-neighbor classification (LMNN) (Weinberger et al., 2006), and information-theoretic metric learning (ITML) (Davis, Kulis, Jain, Sra, & Dhillon, 2007).

*Z*is a normalizing constant, and both can be canceled out in the constrained optimization. Compared with GDM, ITML regularizes the Kullback-Leibler divergence between and

*p*(

^{A}*x*), where

*A*

_{0}is the prior metric, and then transforms this term to a log-det regularization. By specifying , it becomes the maximum entropy estimation of

*p*(

^{A}*x*). Thus, it prefers the distance metric close to the Euclidean distance. The supervised part of SERAPH also follows the maximum entropy principle, but the probabilistic model is discriminative.

A probabilistic GDM was designed intuitively as a baseline method in the experimental part of Yang et al. (2006). It can be viewed as a special case of our supervised part, but the final model of SERAPH is much more general. (For details, see sections 2.2, 7.1 and 7.2.)

Due to the limitation of supervised metric learning when few labeled data are available, semi-supervised models and algorithms that incorporate off-the-shelf unsupervised techniques to existing supervised approaches have been proposed. Local distance metric learning (LDM) (Yang et al., 2006) is the pioneer. Unlike later manifold-based methods, it embeds the unsupervised information by assuming that the eigenvectors of the optimal *A* are the principal components of all training data. Hoi et al. (2008) borrows the idea of Laplacian eigenmaps (Belkin & Niyogi, 2002) and combines manifold regularization to the min-max principle of GDM. Baghshah and Shouraki (2009) then show that Fisher discriminant analysis can be regularized by locally linear embedding (Roweis & Saul, 2000), and the resulting manifold Fisher discriminant analysis (MFDA) is extremely computationally efficient. Liu et al. (2010) brings the element-wise matrix sparsity of *A* to Hoi et al. (2008). In general, any unsupervised embedding method that preserves the local neighborhood information can be modified into a semi-supervised extension.^{7}

The manifold extension described in section 7.2 is so general that it can be attached to all metric learning methods, whereas our information-theoretic extension can only be applied to probabilistic metric learning methods. Nevertheless, any probabilistic method with an explicit expression of the posterior distribution such as NCA, LDM, and SERAPH can have two semi-supervised extensions, while deterministic methods like GDM, LMNN, and MFDA cannot benefit from our semi-supervised extension. ITML used a generative gaussian model whose parameters are not estimated by the algorithm, so it is nontrivial to apply our extension to it.

Here we leave out sparse metric learning and robust metric learning, and instead, recommend Huang, Jin, Xu, & Liu (2010) and Huang, Ying, and Campbell (2009) for reviews of sparse and robust metric learning. Moreover, a comprehensive literature survey on metric learning, (Bellet, Habrard, and Sebban (2013), is available online and is a good reference.

## 6. Experiments

In this section, we numerically evaluate the performance of metric learning methods.

### 6.1. Setup.

In our experiments, we compared SERAPH with six representative metric learning methods (plus the Euclidean distance):

- •
Global distance metric learning (GDM; Xing et al., 2003)

^{8} - •
Neighborhood component analysis (NCA; Goldberger et al., 2005)

^{9} - •
Large-margin nearest-neighbor classification (LMNN; Weinberger et al., 2006)

^{10} - •
Information-theoretic metric learning (ITML; Davis et al., 2007)

^{11} - •
Local distance metric learning (LDM; Yang et al., 2006)

^{12} - •
Manifold Fisher discriminant analysis (MFDA; Baghshah & Shouraki, 2009)

^{13}.

GDM, NCA, LMNN and ITML are supervised methods, while LDM and MFDA are semi-supervised methods. SERAPH, as well as GDM, ITML, and LDM, use the global metric information; and NCA, LMNN, and MFDA benefit from the local metric information.

Table 1 describes the specification of benchmark data sets in our experiments. The top six data sets (Iris, Wine, Ionosphere, Balance, Breast Cancer, and Diabetes) come from the UCI Machine Learning Repository,^{14} and USPS and MNIST are available at the homepage of the late Sam Roweis.^{15} The gray-scale images of handwritten digits in USPS were downsampled to pixel resolution, resulting in 64-dimensional vectors. Similarly, the gray-scale images in MNIST were downsampled to pixel resolution, resulting in 196-dimensional vectors. The symbol USPS_{1−5, 20} means 20 training data from each of the first 5 classes, USPS_{1−10, 40} means 40 training data from each of all 10 classes, MNIST_{1, 7} means digits 1 versus 7, and so forth. Note that in the last two tasks, the dimensionality of data is greater than the number of all training data: the number of parameters to be learned in *A* is *m*(*m*+1)/2=19, 306, whereas the number of training data points *n*_{train} is 100 or 150, and then the number of training data pairs *n*_{train}(*n*_{train}−1)/2 is only 4950 or 11,175.

. | c
. | m
. | n_{train}
. | n_{test}
. | n_{label}
. | . | . | . |
---|---|---|---|---|---|---|---|---|

Iris | 3 | 4 | 100 | 38 | 10 | 15.10 | 29.90 | 4905 |

Wine | 3 | 13 | 100 | 78 | 10 | 13.98 | 31.02 | 4905 |

Ionosphere | 2 | 34 | 100 | 251 | 20 | 97.50 | 92.50 | 4760 |

Balance | 3 | 4 | 100 | 465 | 10 | 20.38 | 24.62 | 4905 |

Breast cancer | 2 | 30 | 100 | 469 | 10 | 23.54 | 21.46 | 4905 |

Diabetes | 2 | 8 | 100 | 668 | 10 | 23.02 | 21.98 | 4905 |

c | m | n_{train} | n_{test} | n_{label} | ||||

USPS_{1−5,20} | 5 | 64 | 100 | 2500 | 10 | 5 | 40 | 4905 |

USPS_{1−5,40} | 5 | 64 | 200 | 2500 | 20 | 30 | 160 | 19,710 |

USPS_{1−10, 20} | 10 | 64 | 200 | 2500 | 20 | 10 | 180 | 19,710 |

USPS_{1−10,40} | 10 | 64 | 400 | 2500 | 40 | 60 | 720 | 79,020 |

MNIST_{1,7} | 2 | 196 | 100 | 1000 | 4 | 2 | 4 | 4944 |

MNIST_{3,5,8} | 3 | 196 | 150 | 1500 | 9 | 9 | 27 | 11,139 |

. | c
. | m
. | n_{train}
. | n_{test}
. | n_{label}
. | . | . | . |
---|---|---|---|---|---|---|---|---|

Iris | 3 | 4 | 100 | 38 | 10 | 15.10 | 29.90 | 4905 |

Wine | 3 | 13 | 100 | 78 | 10 | 13.98 | 31.02 | 4905 |

Ionosphere | 2 | 34 | 100 | 251 | 20 | 97.50 | 92.50 | 4760 |

Balance | 3 | 4 | 100 | 465 | 10 | 20.38 | 24.62 | 4905 |

Breast cancer | 2 | 30 | 100 | 469 | 10 | 23.54 | 21.46 | 4905 |

Diabetes | 2 | 8 | 100 | 668 | 10 | 23.02 | 21.98 | 4905 |

c | m | n_{train} | n_{test} | n_{label} | ||||

USPS_{1−5,20} | 5 | 64 | 100 | 2500 | 10 | 5 | 40 | 4905 |

USPS_{1−5,40} | 5 | 64 | 200 | 2500 | 20 | 30 | 160 | 19,710 |

USPS_{1−10, 20} | 10 | 64 | 200 | 2500 | 20 | 10 | 180 | 19,710 |

USPS_{1−10,40} | 10 | 64 | 400 | 2500 | 40 | 60 | 720 | 79,020 |

MNIST_{1,7} | 2 | 196 | 100 | 1000 | 4 | 2 | 4 | 4944 |

MNIST_{3,5,8} | 3 | 196 | 150 | 1500 | 9 | 9 | 27 | 11,139 |

Note: For each data set, *c* is the number of classes, *m* is the number of features, *n*_{train}/*n*_{test} is the number of training/test data points, and *n*_{label} is the number of class labels to construct and .

All metric learning methods were run repeatedly on 50 random samplings of a given task. For each random sampling, we constructed and , which include the similar and dissimilar data pairs for training, according to the class labels of the first few data points for training: Let *y _{i}* and

*y*be the class labels of

_{j}*x*and

_{i}*x*; then

_{j}- •
and

*y*_{i,j}=+1 if*y*=_{i}*y*._{j} - •
and

*y*_{i,j}=−1 if .

The sizes of and were dependent on the specific random sampling of each UCI task but fixed for all samplings of each USPS and MNIST task. We measured the performance of the one-nearest-neighbor classifiers based on the learned metrics and the computation time for learning the metrics, where the “training data” for our classifiers included only the few data points having class labels.

For SERAPH, we fixed for simplicity. Then four hyperparameter settings were considered:

- •
SERAPH

_{none}stands for and . - •
SERAPH

_{post}stands for and . - •
SERAPH

_{proj}stands for and . - •
SERAPH

_{hyper}stands for and .

There was no cross-validation for each random sampling, because we would like the metrics learned by different methods to be independent of those nearest-neighbor classifiers whose performance had a large deviation given limited supervised information.^{16} The hyperparameters of other methods (e.g., the number of reduced dimensions, the number of nearest neighbors, as well as the percentage of principal components) were selected as the best candidate value based on another 10 random samplings if no default or heuristic value was provided by the original authors of the codes.

### 6.2. Results.

#### 6.2.1. Artificial Data Sets.

#### 6.2.2. Gradient Projection Algorithm versus. EM-Like Algorithm.

Before comparing the proposed SERAPH with other metric learning methods, we evaluated the gradient projection algorithm (GP) and the EM-like iterative algorithm (EM). Table 2 shows their performance where the hyperparameter setting SERAPH_{hyper} was used. By the paired *t*-test at the 5% significance level, GP and EM both won two times and tied eight times, and therefore GP and EM are basically comparable as two different solvers to the same optimization problem. However, EM was computationally more efficient than GP in our experiments consistently and the difference of their average computation time was remarkable, which suggests that, EM as an indirect solver could be a good alternative to the direct solver GP.

. | . | . | . | Computation . |
---|---|---|---|---|

. | GP . | EM . | Paired t-Test
. | Time Ratio . |

Iris | Tie | 3.00 | ||

Wine | Tie | 2.94 | ||

Ionosphere | Tie | 2.04 | ||

Balance | Tie | 1.94 | ||

Breast cancer | Tie | 1.43 | ||

Diabetes | Tie | 1.67 | ||

USPS_{1−5,20} | GP win | 2.58 | ||

USPS_{1−5,40} | Tie | 2.77 | ||

USPS_{1−10, 20} | EM win | 2.81 | ||

USPS_{1−10,40} | EM win | 2.43 | ||

MNIST_{1,7} | GP win | 1.23 | ||

MNIST_{3,5,8} | Tie | 1.99 |

. | . | . | . | Computation . |
---|---|---|---|---|

. | GP . | EM . | Paired t-Test
. | Time Ratio . |

Iris | Tie | 3.00 | ||

Wine | Tie | 2.94 | ||

Ionosphere | Tie | 2.04 | ||

Balance | Tie | 1.94 | ||

Breast cancer | Tie | 1.43 | ||

Diabetes | Tie | 1.67 | ||

USPS_{1−5,20} | GP win | 2.58 | ||

USPS_{1−5,40} | Tie | 2.77 | ||

USPS_{1−10, 20} | EM win | 2.81 | ||

USPS_{1−10,40} | EM win | 2.43 | ||

MNIST_{1,7} | GP win | 1.23 | ||

MNIST_{3,5,8} | Tie | 1.99 |

Notes: Means with standard errors of the nearest-neighbor misclassification rate (in %) are shown, together with results of the paired *t*-test at the significance level 5%. The computation time ratio means the average computation time of GP over that of EM.

#### 6.2.3. Benchmark Data Sets.

The experimental results in terms of the nearest-neighbor misclassification rate are reported in Table 3, where the EM-like algorithm was used. GDM was very slow for high-dimensional data and was excluded from the comparison. SERAPH was fairly promising, especially the hypersparsity setting (i.e., and ). It was best or tied over all 12 tasks. It often statistically significantly outperformed other methods except ITML on six UCI data sets, and it was superior to all other competitors, including SERAPH_{post} and SERAPH_{proj} in four USPS tasks. Furthermore, it successfully improved its accuracy even in two ill-posed MNIST tasks. To sum up, SERAPH can reduce the risk of overfitting weakly labeled data with the help of unlabeled data, and hence our sparsity regularization would be reasonable and practical.

. | Iris . | Wine . | Ionosphere . | Balance . | Breast Cancer . | Diabetes . |
---|---|---|---|---|---|---|

EUCLIDEAN | ||||||

GDM | ||||||

NCA | ||||||

LMNN | ||||||

ITML | ||||||

LDM | ||||||

MFDA | ||||||

SERAPH_{none} | ||||||

SERAPH_{post} | ||||||

SERAPH_{proj} | ||||||

SERAPH_{hyper} | ||||||

USPS_{1−5,20} | USPS_{1−5,40} | USPS_{1−10,20} | USPS_{1−10,40} | MNIST_{1,7} | MNIST_{3,5,8} | |

EUCLIDEAN | ||||||

GDM | - | - | - | - | - | |

NCA | ||||||

LMNN | ||||||

ITML | ||||||

LDM | ||||||

MFDA | ||||||

SERAPH_{none} | ||||||

SERAPH_{post} | ||||||

SERAPH_{proj} | ||||||

SERAPH_{hyper} |

. | Iris . | Wine . | Ionosphere . | Balance . | Breast Cancer . | Diabetes . |
---|---|---|---|---|---|---|

EUCLIDEAN | ||||||

GDM | ||||||

NCA | ||||||

LMNN | ||||||

ITML | ||||||

LDM | ||||||

MFDA | ||||||

SERAPH_{none} | ||||||

SERAPH_{post} | ||||||

SERAPH_{proj} | ||||||

SERAPH_{hyper} | ||||||

USPS_{1−5,20} | USPS_{1−5,40} | USPS_{1−10,20} | USPS_{1−10,40} | MNIST_{1,7} | MNIST_{3,5,8} | |

EUCLIDEAN | ||||||

GDM | - | - | - | - | - | |

NCA | ||||||

LMNN | ||||||

ITML | ||||||

LDM | ||||||

MFDA | ||||||

SERAPH_{none} | ||||||

SERAPH_{post} | ||||||

SERAPH_{proj} | ||||||

SERAPH_{hyper} |

Note: For each data set, the best method and comparable ones based on the unpaired *t*-test at the significance level 5% are in bold.

In vivid contrast with SERAPH that exhibited this generalization capability, supervised methods might learn a metric even worse than the Euclidean distance due to overfitting problems—especially NCA, which optimized the expected leave-one-out classification error on a limited amount of labeled data. The powerful LMNN did not behave satisfyingly, since it was hardly fulfilled finding a lot of neighbors belonging to the same class within labeled data. ITML worked very well despite the fact that it can access only weakly labeled data, but it became less useful for high-dimensional data. We observed that LDM might fail when the principal components of all training data were not close to the eigenvectors of the optimal matrix being learned, and MFDA might fail if the training data cannot afford to recover the data manifold.

An observation is that the methods using the global metric information usually outperformed those using the local metric information in our experiments since the supervised information was insufficient, which is opposite to the phenomena observed in supervised metric learning problem settings. It indicates that the methods using the local metric information tend to fit the given information too much and suffer from overfitting problems, since the local metric information always focuses on a small quantity of data in a local neighborhood and thus has a relatively large deviation.

#### 6.2.4. Computational Efficiency.

Figure 4 summarizes the corresponding experimental results in terms of the average computation time (GDM was excluded from the comparison due to its low speed). The computation time was measured in seconds and drawn in a logarithmic scale with 10 as the base. The shortest average computation time was 0.1677 second of MFDA for Iris, and the longest time was 3023 seconds of LMNN for MNIST_{3,5,8}. Generally SERAPH (when the EM-like algorithm was used) was the second most computationally efficient method. The most computationally efficient method MFDA, consists of just two steps: solve a linear system of locally linear embedding (Roweis & Saul, 2000) and then solve a generalized eigenvalue problem as Fisher discriminant analysis (Fisher, 1936). Improvements may be expected if we program in Matlab with C/C++ as NCA and LMNN.

#### 6.2.5. Sensitivity to Regularization Parameters.

Recall that there was no cross-validation within each random sampling, so it would be helpful to test the sensitivity of SERAPH to the regularization parameters and . Six benchmark data sets were included:

- •
Diabetes, Iris, and Ionosphere, on which SERAPH

_{none}, SERAPH_{post}, and SERAPH_{proj}had the lowest means of the nearest-neighbor misclassification rate in Table 3, respectively. - •
Balance, USPS

_{1−5,20}, and MNIST_{1,7}, on which SERAPH_{hyper}had lowest means of the nearest-neighbor misclassification rate in Table 3.

^{−3}to 2

^{+3}with 2

^{0.5}as the factor, and the actual regularization parameter being used was . The gradient projection algorithm was repeatedly run on 10 random samplings, which were the first 10 random samplings of those 50 random samplings, given all combinations of and . The resulting contours are displayed in Figure 5.

We can see that SERAPH worked well in large areas of the contour plots in Figure 5, and we can clearly observe two phenomena.

First, SERAPH was more sensitive to for the low-dimensional tasks (Diabetes, Iris, Ionosphere, and Balance), and the learned metrics became worse when became large. Even for USPS_{1−5,20}, the learned metrics also became worse when became large for small . Note that large implies the strong regularization on the trace norm of *A*, which upper-bounds the rank of *A*, and the rank of *A* ultimately controls the number of parameters to be learned in *A*. This explains why the contours of MNIST_{1,7} were different from others, as there were so many parameters in *A* that large did not make SERAPH significantly overregularized.

Second, SERAPH was also sensitive to for the high-dimensional tasks USPS_{1−5,20} and MNIST_{1,7}, while this time when became large, the learned metrics became worse for small but better for large . In other words, SERAPH easily got overregularized by emphasizing unlabeled data too much if the number of parameters to be learned in *A* was improperly large, whereas it hardly got overregularized if the number of parameters was properly small. Additionally, for USPS_{1−5,20} and MNIST_{1,7}, neither the posterior sparsity nor the projection sparsity worked alone, but they became very powerful after they were integrated into the hypersparsity. A final caveat is that the hypersparsity could never be a panacea for such high-dimensional tasks and we should not employ it too much. The contours of MNIST_{1,7} have exhibited a typical effect that the learned metrics became worse suddenly and rapidly along the line when became very large.

## 7. Extensions

In this section, we explain the kernel and manifold extensions of SERAPH. The technique for kernelizing a metric learning method was originally proposed in Jain, Kulis, and Dhillon (2010), and the technique for manifold regularizing a metric learning method was originally proposed in Hoi et al. (2008).

### 7.1. Kernel Extension.

*W*directly, since is often very large and possibly infinite. In order to learn

*W*indirectly, we rewrite optimization 3.1 with respect to

*W*: where is similar to equation 3.2: Subsequently, according to Jain et al. (2010), any optimal solution will be in the form of , where is the design matrix obtained by applying to the training set and is an optimal solution to which is actually optimization 3.1 with respect to

*A*but in equation 3.2 is replaced with

*K*. Then for any : All components of SERAPH remain the same after replacing

*x*with the corresponding

_{i}*k*. The resultant Mahalanobis distance metric will be highly nonlinear with respect to the original input data domain.

_{i}*k*must be from , and hence the sparse variant of the cosine kernel is with a hyperparameter , where means that

*x*is one of the nearest neighbors of

_{i}*x*in , and was set to 11 so that 10 nearest neighbors were found for except itself. We can see from Table 4 that SERAPH still performed well—and even better after applying the kernel extension. Among all 12 tasks on the UCI, USPS, and MNIST data sets, the records were improved by the kernel extension in seven tasks, and the improvement was significant under the paired

*t*-test at the 5% significance level. We may roughly compare the experimental results in Table 4 with similar results in Jain et al. (2010)

^{17}and Wang, Do, Woznica, and Kalousis (2011), while we should be aware that there were many fewer training data for SERAPH as well as for the following nearest-neighbor classifiers, and a single kernel was also much weaker than multiple kernels.