Abstract

Distance metric learning has been widely used to obtain the optimal distance function based on the given training data. We focus on a triplet-based loss function, which imposes a penalty such that a pair of instances in the same class is closer than a pair in different classes. However, the number of possible triplets can be quite large even for a small data set, and this considerably increases the computational cost for metric optimization. In this letter, we propose safe triplet screening that identifies triplets that can be safely removed from the optimization problem without losing the optimality. In comparison with existing safe screening studies, triplet screening is particularly significant because of the huge number of possible triplets and the semidefinite constraint in the optimization problem. We demonstrate and verify the effectiveness of our screening rules by using several benchmark data sets.

1  Introduction

Using an appropriate distance function is essential for various machine learning tasks. For example, the performance of a k-nearest neighbor (k-NN) classifier, one of the most standard classification methods, depends crucially on the distance between different input instances. The simple Euclidean distance is usually employed, but it is not necessarily optimal for a given data set and task. Thus, the adaptive optimization of the distance metric based on supervised information is expected to improve the performance of machine learning methods including k-NN.

Distance metric learning (Weinberger & Saul, 2009; Schultz & Joachims, 2004; Davis, Kulis, Jain, Sra, & Dhillon, 2007; Kulis, 2013) is a widely accepted technique for acquiring the optimal metric from observed data. The standard problem setting is to learn the following parameterized Mahalanobis distance,
dM(xi,xj):=(xi-xj)M(xi-xj),
where xi and xj are d-dimensional feature vectors and MRd×d is a positive semidefinite matrix. This approach has been applied to tasks such as classification (Weinberger & Saul, 2009), clustering (Xing, Jordan, Russell, & Ng, 2003), and ranking (McFee & Lanckriet, 2010). These studies show that the optimized distance metric improves the prediction performance of each task. Metric optimization has also attracted wide interest, even from researchers engaged in recent deep network studies (Schroff, Kalenichenko, & Philbin, 2015; Hoffer & Ailon, 2015).
The seminal work of distance metric learning (Weinberger & Saul, 2009) presents a triplet-based formulation. A triplet (i,j,l) is defined by the pair xi and xj, which have the same label (same class), and xl, which has a different label (different class). For a triplet (i,j,l), the desirable metric would satisfy dM(xi,xj)<dM(xi,xl), meaning that the pair in the same class is closer than the pair in different classes. For each of the triplets, Weinberger and Saul (2009) define a loss function that penalizes violations of this constraint,
dM2(xi,xl)-dM2(xi,xj),for(i,j,l)T,
where T is a set of triplets and :RR is some loss function (e.g., the standard hinge loss function). In addition to the triplet loss, other approaches, such as pairwise- and virtual point–based loss functions have been studied. In the pairwise approach, the number of pairs can be much smaller than the triplets; Davis et al. (2007) used only 20c2 pairs, where c is the number of classes. The virtual point approach (Perrot & Habrard, 2015) converts the metric learning problem into a least squares problem, which minimizes a loss function for n virtual points. We particularly focus on the triplet approach because the relative evaluation dM(xi,xj)<dM(xi,xl) would be more appropriate for many metric learning applications such as nearest-neighbor classification (Weinberger & Saul, 2009), and similarity search (Jain, Kulis, Dhillon, & Grauman, 2009), in which relative comparison among objects plays an essential role. In fact, a recent comprehensive survey (Li & Tian, 2018) showed that many current state-of-the-art methods are based on triplet loss, which they referred to as relative loss. Note that although the quadruplets approach (Law, Thome, & Cord, 2013) can also incorporate higher-order relations, we mainly focus on the triplet approach because it is much more popular in the community, although our framework can also accommodate the quadruplet case, as we explain in section 6.3.

However, the set of triplets T is quite large even for a small data set. For example, in a two-class problem with 100 instances in each class, the number of possible triplets is 1,980,000. Because processing a huge number of triplets is computationally prohibitive, a small subset is often used in practice (Weinberger & Saul, 2009; Shi, Bellet, & Sha, 2014; Capitaine, 2016). Typically, a subset of triplets is selected by using the neighbors of each training instance. For n training instances, Shi et al. (2014) selected only 30n triplets, and Weinberger and Saul (2009) selected at most O(kn2) triplets, where k is a prespecified constant. However, the effect on the final accuracy of these heuristic selections is difficult to know beforehand. Jain, Mason, and Nowak (2017) theoretically analyzed a probabilistic generalization error bound for a random subsampling strategy of triplets. Their analysis revealed the sample complexity of metric learning, but the tightness of the bound is not clear and they did not demonstrate the practical use of determining the required number of triplets. For ordinal data embedding, Jamieson and Nowak (2011) showed a lower bound of required triplets Ω(dnlogn) to determine the embedding, but the tightness of this bound is also not known. Further, the applicability of the analysis to metric learning was not clarified.

Our safe triplet screening enables the identification of triplets that can be safely removed from the optimization problem without losing the optimality of the resulting metric. This means that our approach can accelerate the optimization of time-consuming metric learning with the guarantee of optimality. Figure 1 shows a schematic illustration of safe triplet screening.

Figure 1:

Metric learning with safe triplet screening. The naive optimization needs to minimize the sum of the loss function values for a huge number of triplets (i,j,l). Safe triplet screening identifies a subset of L (blue points in the illustration on the right) and R (green points in the illustration on the right), corresponding to the location of the loss function on which each triplet lies by using the optimal M. This enables reducing the number of triplets to be reduced in the optimization problem.

Figure 1:

Metric learning with safe triplet screening. The naive optimization needs to minimize the sum of the loss function values for a huge number of triplets (i,j,l). Safe triplet screening identifies a subset of L (blue points in the illustration on the right) and R (green points in the illustration on the right), corresponding to the location of the loss function on which each triplet lies by using the optimal M. This enables reducing the number of triplets to be reduced in the optimization problem.

formula
formula
formula

Our approach is inspired by the safe feature screening of Lasso (Ghaoui, Viallon, & Rabbani, 2010), in which unnecessary features are identified by the following procedure:

  • Step 1: Construct a bounded region in which the optimal dual solution is guaranteed to exist.

  • Step 2: Given the bound created by step 1, remove features that cannot be selected by Lasso.

This procedure is useful to mitigate the optimization difficulty of Lasso for high-dimensional problems; thus, many papers propose a variety of approaches to create bounded regions for obtaining a tighter bound that increases screening performance (Wang, Zhou, Wonka, & Ye, 2013; Liu, Zhao, Wang, & Ye, 2014; Fercoq, Gramfort, & Salmon, 2015; Xiang, Wang, & Ramadge, 2017). As another direction of research, the screening idea was applied to other learning methods, including support vector machine nonsupport vector screening (Ogawa, Suzuki, & Takeuchi, 2013), nuclear norm regularization subspace screening (Zhou & Zhao, 2015), and group Lasso group screening (Ndiaye, Fercoq, Gramfort, & Salmon, 2016).

Based on the safe feature screening techniques, we build the procedure of our safe triplet screening as follows:

  • Step 1: Construct a bounded region in which the optimal solution M is guaranteed to exist.

  • Step 2: For each triplet (i,j,l)T, verify the possible loss function value under the condition created by step 1.

We show that as a result of step 2, we can reduce the size of the metric learning optimization problem, by which the computational cost of the optimization can be drastically reduced. Although a variety of extensions of safe screening have been studied in the machine learning community (Lee & Xing, 2014; Wang, Wonka, & Ye, 2014; Zimmert, de Witt, Kerg, & Kloft, 2015; Zhang et al., 2016; Ogawa et al., 2013; Okumura, Suzuki, & Takeuchi, 2015; Shibagaki, Karasuyama, Hatano, & Takeuchi, 2016; Shibagaki, Suzuki, Karasuyama, & Takeuchi, 2015; Nakagawa, Suzumura, Karasuyama, Tsuda, & Takeuchi, 2016; Takada, Hanada, Yamada, Sakuma, & Takeuchi, 2016; Hanada, Shibagaki, Sakuma, & Takeuchi, 2018), to the best of our knowledge, no studies have considered screening for metric learning. Compared with existing studies, our safe triplet screening is particularly significant due to the huge number of possible triplets and the semidefinite constraint. Our technical contributions are summarized as follows:

  • We derive six spherical regions in which the optimal M must lie and analyze their relationships.

  • We derive three types of screening rules, each of which employs a different approach to the semidefinite constraint.

  • We derive efficient rule evaluation for a special case when M is a diagonal matrix.

  • We build an extension for the regularization path calculation.

We further demonstrate the effectiveness of our approach based on several benchmark data sets with a huge number of triplets.

This letter is organized as follows. In section 2, we define the optimization problem of large-margin metric learning. In section 3, we first derive six bounds containing optimal M for the subsequent screening procedure. Section 4 derives the rules and constructs our safe triplet screening. The computational cost for the rule evaluation is analyzed in section 5. Extensions are discussed in section 6, in which an algorithm specifically designed for the regularization path calculation, and a special case, in which M is a diagonal matrix, are considered. In section 7, we present the evaluation of our approach through numerical experiments. Section 8 concludes.

1.1  Notation

We denote by [n] the set {1,2,,n} for any integer nN. The inner product of the matrices is denoted by A,B:=ijAijBij=tr(AB). The squared Frobenius norm is represented by AF2:=A,A. The positive semidefinite matrix MRd×d is denoted by MO or MR+d×d. By using the eigenvalue decomposition of matrix M=VΛV, matrices M+ and M- are defined as follows,
M=V(Λ++Λ-)ΛV=VΛ+V:=M++VΛ-V:=M-,
where Λ+ and Λ- are constructed only by the positive and negative components of the diagonal matrix Λ. Note that M+,M-=tr(VΛ+VVΛ-V)=tr(VOV)=0, and M+ is a projection of M onto the semidefinite cone—M+=argminAOA-MF2.

2  Preliminary

Let {(xi,yi)i[n]} be n pairs of a d-dimensional feature vector xiRd and a label yiY, where Y is a discrete label space. We consider learning the following Mahalanobis distance,
dM(xi,xj):=(xi-xj)M(xi-xj),
(2.1)
where MR+d×d is a positive semidefinite matrix that parameterizes distance. As a general form of the metric learning problem, we consider a regularized triplet loss minimization (RTLM) problem. Our formulation is mainly based on a model originally proposed by Weinberger and Saul (2009), which is reduced to a convex optimization problem with the semidefinite constraint. For later analysis, we derive primal and dual formulations, and to discuss the optimality of the learned metric, we focus on the convex formulation of RTLM in this letter.

2.1  Triplet-Based Loss Function

We define a triplet of instances as
T=(i,j,l)(i,j)S,yiyl,l[n],
where S=(i,j)yi=yj,ij,(i,j)[n]×[n]. The set S contains index pairs from the same class, and T represents a triplet of indices consisting of (i,j)S, and l, which is in a class that differs from that of i and j. We refer to the following loss as the triplet loss:
dM2(xi,xl)-dM2(xi,xj),for(i,j,l)T,
where :RR is some loss function. By substituting equation 2.1 into the triplet loss, this can be written as
M,Hijl,
where Hijl:=(xi-xl)(xi-xl)-(xi-xj)(xi-xj). For the triplet loss, we consider the hinge function,
(x)=max{0,1-x},
(2.2)
or the smoothed hinge function,
(x)=0,x>1,12γ(1-x)2,1-γx1,1-x-γ2,x<1-γ,
(2.3)
where γ>0 is a parameter. Note that the smoothed hinge includes the hinge function as a special case (γ0). The triplet loss imposes a penalty if a pair (i,j)S is more distant than the threshold compared with a pair i and l, which are in different classes. Both of the two loss functions contain a region in which no penalty is imposed. We refer to this as the zero region. The two loss functions also contain a region in which the penalty increases linearly, which we refer to as the linear region.

2.2  Primal and Dual Formulation of Triplet-Based Distance Metric Learning

Using the standard squared regularization, we consider the following RTLM as a general form of metric learning:
minMOPλ(M):=ijlM,Hijl+λ2MF2,
(Primal)
where ijl denotes (i,j,l)T, and λ>0 is a regularization parameter. In section 6.3, we discuss the relation of RTLM to existing metric learning methods.
The dual problem is written as
max0α1,ΓODλ(α,Γ):=-γ2α22+α1-λ2Mλ(α,Γ)F2,
(Dual1)
where αR|T|, which contains αijl for (i,j,l)T, and ΓRd×d are dual variables, and
Mλ(α,Γ):=1λijlαijlHijl+Γ.
(2.4)
A derivation of this dual problem is presented in appendix A. Because the last term maxΓO-12Mλ(α,Γ)F2 is equivalent to the projection onto a semidefinite cone (Boyd & Xiao, 2005; Malick, 2004), the above problem, Dual1, can be simplified as
max0α1Dλ(α):=-γ2α22+α1-λ2Mλ(α)F2,
(Dual2)
where
Mλ(α):=1λijlαijlHijl+.
For the optimal M, each of the triplets in T can be categorized into three groups:
R:={(i,j,l)THijl,M>1},C:={(i,j,l)T1-γHijl,M1},L:={(i,j,l)THijl,M<1-γ}.
(2.5)
This indicates that the triplets in R and those in L are the zero region and linear region of the loss function, respectively. The well-known KKT conditions provide the following relation between the optimal dual variable and the derivative of the loss function (see appendix A for details):
αijl=-(M,Hijl).
(2.6)
In the case of hinge loss, the derivative is written as
(M,Hijl)=0,M,Hijl>1,-c,M,Hijl=1,-1,M,Hijl<1,
where c[0,1]. In the case of smoothed hinge loss, the derivative is
(M,Hijl)=0,M,Hijl>1,-1γ(1-M,Hijl),1-γM,Hijl1,-1,M,Hijl<1-γ.
Both cases can be represented as
(M,Hijl)=0,(i,j,l)R,-[0,1],(i,j,l)C,=-1,(i,j,l)L.
(2.7)
From equations 2.7 and 2.6, we obtain the following rules for the optimal dual variable:
(i,j,l)Rαijl=0,(i,j,l)Cαijl[0,1],(i,j,l)Lαijl=1.
(2.8)

The nonlinear semidefinite programming problem of RTLM can be solved by gradient methods including the primal-based (Weinberger & Saul, 2009) and dual-based approaches (Shen, Kim, Liu, Wang, & Van Den Hengel, 2014). However, the amount of computation may be prohibitive because of the large number of triplets. The naive calculation of the objective function requires O(d2|T|) computations for both the primal and the dual cases.

2.3  Reduced-Size Optimization Problem

Assuming that we have a subset of triplets (i,j,l)LR before solving the optimization problem. Let L^L and R^R be the subsets of L and R we identify. Then, based on this prior knowledge, the optimization problem, Primal, can be transformed into the following reduced-size problem:
P˜λ(M)=(i,j,l)T˜(M,Hijl)+λ2MF2+1-γ2|L^|-M,(i,j,l)L^Hijl,
(2.9)
where T˜:=T-L^-R^. This problem differs from the original, Primal, as follows
  • The loss term for R^ is removed because it does not produce any penalty at the optimal solution.

  • The loss term for L^ is fixed at the linear part of the loss function by which the sum over triplets can be calculated beforehand (the last two terms).

The dual problem of this reduced-size problem can be written as
min0α1D˜λ(α):=-γ2α22+α1-λ2Mλ(α)F2,s.t.αL^=1,αR^=0.
(2.10)
which is the same optimization problem as Dual2 except that αL^ and αR^ are fixed. Because of this constraint, the number of free variables in this dual problem is |T˜|. An important property of a reduced-size problem is that it retains the same optimal solution as the original problem:
Lemma 1.

The primal-dual problem pair, equations 2.9 and 2.10, and the original problem pair, Primal and Dual2, have the same optimal primal and dual solutions.

The proof of this lemma is shown in appendix B, along with the derivation of the reduced-size dual, equation 2.10. Therefore, if a large number of L^ and R^ could be detected beforehand (i.e., |T˜||T|), the metric learning optimization would be accelerated dramatically.

3  Spherical Bound

As we will see, our safe triplet screening is derived by using a spherical region that contains the optimal M. In this section, we show that six variants of the regions are created by three types of different approaches. Note that the proofs for all the theorems appear in the appendixes.

3.1  Gradient Bound

We first introduce a hypersphere, which we name gradient bound (GB), because the center and radius of the hypersphere are represented by the subgradient of the objective function:

Theorem 1
(GB). Given any feasible solution MO, the optimal solution M for λ exists in the following hypersphere:
M-QGB(M)F212λPλ(M)F2,
where QGB(M):=M-12λPλ(M).

The proof is in appendix C. This theorem is an extension of the sphere for SVM (Shibagaki et al., 2015), which can be treated as a simple unconstrained problem.

3.2  Projected Gradient Bound

Even when we substitute the optimal M into the reference solution M, the radius of the GB is not guaranteed to be 0. By projecting the center of GB onto the feasible region (i.e., a semidefinite cone), another GB-based hypersphere can be derived, which has a radius converging to 0 at the optimal. We refer to this extension as projected gradient bound (PGB); a schematic illustration is shown as Figure 2a. In Figure 2a, the center of the GB QGB (the abbreviation of QGB(M)) is projected onto the semidefinite cone, which becomes the center of PGB Q+GB. The sphere of PGB can be written as

Figure 2:

Illustrations of spherical bounds.

Figure 2:

Illustrations of spherical bounds.

Theorem 2
(PGB). Given any feasible solution MO, the optimal solution M for λ exists in the following hypersphere:
M-QGB(M)+F212λPλ(M)F2-QGB(M)-F2.

The proof is in appendix D. PGB contains the projections onto the positive and the negative semidefinite cone in the center and the radius, respectively. These projections require the eigenvalue decomposition of M-12λPλ(M). This decomposition, however, only needs to be performed once to evaluate the screening rules of all the triplets. In the standard optimization procedures of RTLM, including Weinberger and Saul (2009), the eigenvalue decomposition of the d×d matrix is calculated in every iterative cycle, and thus, the computational complexity is not increased by PGB.

The following theorem shows a superior convergence property of PGB compared to GB:

Theorem 3.

There exists a subgradient Pλ(M) such that the radius of PGB is 0.

For the hinge loss, which is not differentiable at the kink, the optimal dual variables provide subgradients that set the radius equal to 0. This theorem is an immediate consequence of the proof in appendix I, which is the proof for the relation between PGB and the other bound derived in section 3.4.

From Figure 2a, we see that the half space -Q-GB,X0, where Q-GB=QGB-Q+GB, can be used as a linear relaxation of the semidefinite constraint for the linear constraint rule in section 4.3. Interestingly, the GB with this linear constraint is tighter than the PGB. This is proved in appendix D, which gives the proof of the PGB.

3.3  Duality Gap Bound

In this section, we describe the duality gap bound (DGB) in which the radius is represented by the duality gap:

Theorem 4
(DGB). Let M be a feasible solution of the primal problem and α and Γ be feasible solutions of the dual problem. Then the optimal solution of the primal problem M exists in the following hypersphere:
M-MF22(Pλ(M)-Dλ(α,Γ))/λ.

The proof is in appendix E. Because the radius is proportional to the square root of the duality gap, DGB obviously converges to 0 at the optimal solution (see Figure 2b). The DGB, unlike the previous bounds, requires a dual feasible solution. This means that when a primal-based optimization algorithm is employed, we need to create a dual feasible solution from the primal feasible solution. A simple way to create a dual feasible solution is to substitute the current M into M of equation 2.6. When a dual-based optimization algorithm is employed, a primal feasible solution can be created by equation 2.4.

For the DGB, we can derive a tighter bound, the constrained duality gap bound (CDGB), with an additional constraint. However, except for a special case (dynamic screening with a dual solver), additional transformation of the reference solution is necessary, which can deteriorate the duality gap. See appendix F for further details.

3.4  Regularization Path Bound

In Wang et al. (2014), a hypersphere is proposed specifically for the regularization path, in which the optimization problem should be solved for a sequence of λs. Suppose that λ0 has already been optimized and it is necessary to optimize λ1. Then the same approach as Wang et al. (2014) is applicable to our RTLM, which derives a bound depending on the optimal solution for λ0 as a reference solution:

Theorem 5
(RPB). Let M0 be the optimal solution for λ0. Then the optimal solution M1 for λ1 exists in the following hypersphere:
M1-λ0+λ12λ1M0F2λ0-λ12λ1M0F2.

The proof is in appendix G. We refer to this bound as the regularization path bound (RPB).

The RPB requires the theoretically optimal solution M0, which is numerically impossible. Furthermore, because the reference solution is fixed on M0, the RPB can be performed only once for a specific pair of λ0 and λ1 even if the optimal M0 is available. The other bounds can be performed multiple times during the optimization by regarding the current approximate solution as a reference solution.

3.5  Relaxed Regularization Path Bound

To use the RPB in practice, we modify this bound in such a way that the approximate solution can be used as a reference solution. Assume that M0 should satisfy
M0-M0Fε,
where ε0 is a constant. Given M0, which satisfies the above condition, we obtain the relaxed regularization path bound (RRPB):
Theorem 6
(RRPB). Let M0 be an approximate solution for λ0, which satisfies M0-M0Fε. The optimal solution M1 for λ1 exists in the following hypersphere:
M1-λ0+λ12λ1M0F2|λ0-λ1|2λ1M0F+|λ0-λ1|+λ0+λ12λ1ε2.
(3.1)

The proof is in appendix H. The intuition behind the RRPB is shown in Figure 2d, in which the approximation error for the center of the RPB is depicted. In the theorem, the RRPB also considers the error in the radius, although it is not illustrated in the figure for simplicity. To the best of our knowledge, this approach has not been introduced in other existing screening studies.

For example, ε can be set from theorem 5 (DGB) as follows:
ε=2(Pλ0(M0)-Dλ0(α0,Γ0))/λ0.
(3.2)
When the optimization for λ0 terminates, the solution M0 should be accurate in terms of some stopping criterion such as the duality gap. Then ε is expected to be quite small, and the RRPB can provide a tight bound for λ1, which is close to the ideal (but not computable) RPB. As a special case, by setting λ1=λ0, the RRPB can be applied to perform the screening of λ1 using any approximate solution having M1-MFε, and then the RRPB is equivalent to the DGB.

3.6  Analytical Relation between Bounds

The following theorem describes the relation between PGB and RPB:

Theorem 7

(Relation between PGB and RPB). Suppose that the optimal solution M0 for λ0 is substituted into the reference solution M of PGB. Then there exists a subgradient Pλ1(M0) by which the PGB and RPB provide the same center and radius for M1.

The proof is presented in appendix I. The following theorem describes the relation between the DGB and RPB:

Theorem 8

(Relation between DGB and RPB). Suppose that the optimal solutions M0,α0, and Γ0 for λ0 are substituted into the reference solutions M,α, and Γ of the DGB. Then the radius of DGB and RPB for λ1 has a relation rDGB=2rRPB, and the hypersphere of RPB is included in the hypersphere of DGB.

The proof is in appendix J. Figure 2c illustrates the relation between the DGB and RPB, which shows the theoretical advantage of the RPB for the regularization path setting.

Using the analytical results obtained thus far, we summarize relative relations between the bounds as follows. First, we consider the case in which the reference solution is optimal for λ0 in the regularization path calculation. We obviously see rGBrPGB from Figure 2a, and from theorems 8 and 9, we see DGBPGB=RPB=RRPB. When the reference solution is an approximate solution in the regularization path calculation, we see only rGBrPGB. For dynamic screening in which the reference solution is always an approximate solution, we see rGBrPGB, and we also see RRPB=DGB when ε is determined by DGB as written in equation 3.2.

Other properties of the bounds are summarized in Table 1. Although DGB and RRPB (RPB + DGB) have the same properties, our empirical evaluation in section 7.2 shows that RRPB often outperforms DGB in the regularization path calculation. (Note that although CDGB also has the same properties as the above two methods, we omit it in the empirical evaluation because of its practical limitation, as we see in section 3.3.)

Table 1:
Comparison of Sphere Bounds.
Radius ConvergenceDynamic ScreeningReference SolutionExact Optimality of Reference
GB Can be >0 Applicable Primal Not necessary 
PGB =0a Applicable Primal Not necessary 
DGB =0 Applicable Primal/dual Not necessary 
CDGB =0 Applicable Primal/dual Not necessary 
RPB NA Not applicable Primal Necessary 
RRPB =0 Applicable Primal/dual Not necessary 
(RPB + DGB)     
Radius ConvergenceDynamic ScreeningReference SolutionExact Optimality of Reference
GB Can be >0 Applicable Primal Not necessary 
PGB =0a Applicable Primal Not necessary 
DGB =0 Applicable Primal/dual Not necessary 
CDGB =0 Applicable Primal/dual Not necessary 
RPB NA Not applicable Primal Necessary 
RRPB =0 Applicable Primal/dual Not necessary 
(RPB + DGB)     

Note: The radius convergence indicates a radius when the reference solution is the optimal solution.

aFor the hinge loss (γ=0) case, a subgradient is required to be selected appropriately for achieving this convergence.

4  Safe Rules for Triplets

Our safe triplet screening can reduce the number of triplets by identifying a part of L and R before solving the optimization problem based on the following procedure:

  • Step 1: Identify the spherical region in which the optimal solution M lies, based on the current feasible solution we refer to as the reference solution.

  • Step 2: For each triplet (i,j,l)T, verify the possibility of (i,j,l)L or (i,j,l)R under the condition that M is in the region.

In section 3, we showed that there exist a variety of approaches to creating the spherical region for step 1. In this section, we describe the procedure of step 2 given the sphere region.

Letting B be a region that contains M, the following screening rule can be derived from equation 2.5:
maxXBX,Hijl<1-γ(i,j,l)L,
(R1)
minXBX,Hijl>1(i,j,l)R.
(R2)
Based on these rules, L^L and R^R are constructed as
L^=(i,j,l)|maxXBX,Hijl<1-γ,R^=(i,j,l)|minXBX,Hijl>1.
We present an efficient approach to evaluating these rules. Because equation R1 can be evaluated in the same way as R2, we are concerned only with equation R2 henceforth.

4.1  Spherical Rule

Suppose that the optimal M lies in a hypersphere defined by a center QRd×d and a radius rR+. To evaluate the condition of equation R2, we consider the following minimization problem, equation P1:
minXX,Hijls.t.X-QF2r2.
(P1)
Letting Y:=X-Q, this problem is transformed into
minYY,Hijl+Q,Hijls.t.YF2r2.
Because Q,Hijl is a constant, this optimization problem entails minimizing the inner product Y,Hijl under the norm constraint. The optimal Y of this optimization problem is easily derived as
Y=-rHijl/HijlF,
and then the minimum value of P1 is Hijl,Q-rHijlF. Figure 3 shows a schematic illustration. This derives the following spherical rule:
Hijl,Q-rHijlF>1(i,j,l)R.
(4.1)
This condition can be easily evaluated for a given Q and r.
Figure 3:

Spherical rule defined by equation P1. The yellow sphere indicates the region in which the optimal M must exist. The terms “max” and “min” indicate the points at which the maximum and minimum values of the inner product X,Hijl are attained. If X,Hijl>1 holds, condition R2 is guaranteed to be satisfied.

Figure 3:

Spherical rule defined by equation P1. The yellow sphere indicates the region in which the optimal M must exist. The terms “max” and “min” indicate the points at which the maximum and minimum values of the inner product X,Hijl are attained. If X,Hijl>1 holds, condition R2 is guaranteed to be satisfied.

4.2  Spherical Rule with a Semidefinite Constraint

The spherical rule does not utilize the positive semidefiniteness of M; therefore, a stronger rule can be constructed by incorporating a semidefinite constraint into equation P1:
minXX,Hijls.t.X-QF2r2,XO.
(P2)
Although the analytical solution is not available, equation P2 can be solved efficiently by transforming it into the semidefinite least squares (SDLS) problem (Malick, 2004).
Let BPSD:={XX-QF2r2,XO} be the feasible region of the optimization problem P2. To present the connection between SDLS and equation P2, we first assume that there exists a feasible solution X0 for equation P2 that satisfies X0,Hijl>1:
X0suchthatX0,Hijl>1andX0BPSD.
(4.2)
Instead of equation P2, we consider the following SDLS problem:
minXRd×dX-QF2s.t.X,Hijl=1,XO.
(SDLS)
If the optimal value of this problem is greater than r2 (i.e., X-QF2>r2), there is no intersection between BPSD and the subspace defined by X,Hijl=1:
X|X,Hijl=1,XBPSD=.
(4.3)
From assumption 4.2, we have
X|X,Hijl>1,XBPSD.
(4.4)
As BPSD is a convex set, based on the two conditions 4.3 and 4.4, we derive
X|X,Hijl1,XBPSD=,
which indicates
minXBPSDX,Hijl>1,
Thus, the condition of equation R2 is satisfied.

Based on the connection shown above, the rule evaluation, equation R2, with the semidefinite constraint is summarized as follows:

  1. Select an arbitrary feasible solution X0BPSD. If X0,Hijl1, we immediately see that the condition of equation R2 is not satisfied for the triplet (i,j,l). Otherwise, go to the next step. Note that in this case, assumption 4.2 is confirmed because X0,Hijl>1).

  2. Solve SDLS. If the optimal value satisfies X-QF2>r2, the triplet (i,j,l) is guaranteed to be in R.

For calculating the second step, we derive the following dual problem of equation SDLS based on Malick (2004):
maxyDSDLS(y):=-[Q+yHijl]+F2+2Cy+QF2,
where yR is a dual variable, and C=1 for equation R2 and C=1-γ for equation R1. Unlike the primal problem, the dual version is an unconstrained problem that has only one variable, y, and thus, standard gradient-based algorithms rapidly converge. We refer to the quasi-Newton optimization for this problem as the SDLS dual ascent method. During dual ascent, we can terminate the iteration before convergence if DSDLS(y) becomes larger than r2 because the value of the dual problem does not exceed the value of the primal problem (weak duality).

Although the computation of [Q+yHijl]+ requires an eigenvalue decomposition, this computational requirement can be alleviated when the center Q of the hypersphere is positive semidefinite. The definition determines that Hijl has at most one negative eigenvalue, and then Q+yHijl also has at most one negative eigenvalue. Let λmin be the negative (minimum) eigenvalue of Q+yHijl, and qmin be the corresponding eigenvector. The projection [Q+yHijl]+ can be expressed as [Q+yHijl]+=(Q+yHijl)-λminqminqmin. Computation of the minimum eigenvalue and eigenvector is much easier than the full eigenvalue decomposition (Lehoucq & Sorensen, 1996).

As a special case, when M is a diagonal matrix, the semidefinite constraint is reduced to the nonnegative constraint, and analytical calculation of rule P2 is possible (see section 6.2).

4.3  Spherical Rule with Linear Constraint

Here, we reduce the computational complexity by considering the relaxation of the semidefinite constraint into a linear constraint. Suppose that a region defined by the linear inequality {XRd×dP,X0} contains a semidefinite cone, R+d×d{XRd×dP,X0}, for which we describe the determination of PRd×d later. Using this relaxed constraint, condition R2 is
minXX,Hijls.t.X-QF2r2,P,X0.
(P3)
This problem can be solved analytically by considering the KKT conditions as follows (see appendix K).
Theorem 9
(Analytical Solution of Equation P3). The optimal solution of equation P3 is as follows:
Hijl,X=0,ifHijl=aP,Hijl,Q-rHijlF,ifP,Q-rHijlHijlF0,Hijl,βP-Hijlα+Q,otherwise,
where a is a constant and
α=PF2HijlF2-P,Hijl2r2PF2-P,Q2,β=P,Hijl-αP,QPF2.
A simple way to obtain P is to utilize the projection onto the semidefinite cone. Let ARd×d be a matrix external to the semidefinite cone as illustrated in Figure 4. In the figure, A+ is the projection of A onto the semidefinite cone. For example, when the projected gradient for the primal problem (Weinberger & Saul, 2009) is used as an optimizer, A can be an update of the gradient descent A=M-ηPλ(M) with some step size η>0. Because M-ηPλ(M) is projected onto the semidefinite cone at every iterative step of the optimization, no additional calculation is required to obtain A and A+. Defining A-:=A-A+, for any XO, we obtain
A+-A,X-A+0-A-,X0.
The inequality on the left has its origins in the property of a supporting hyperplane (Boyd & Vandenberghe, 2004), and for the inequality on the right, we use A+,A-=0. By setting P=-A-, we obtain a linear approximation of the semidefinite constraint, which is a superset of the original semidefinite cone.
Figure 4:

Linear relaxation of semidefinite constraint. From the projection of A to A+, the supporting hyperplane -A-,X=0 is constructed, and the half-space {X-A-,X0} contains the semidefinite cone XO.

Figure 4:

Linear relaxation of semidefinite constraint. From the projection of A to A+, the supporting hyperplane -A-,X=0 is constructed, and the half-space {X-A-,X0} contains the semidefinite cone XO.

A necessary condition for performing our screening is that a loss function needs to have at least one linear region or a zero region. For example, the logistic loss cannot be used for screening because it has neither a linear nor a zero region.

5  Computations

Algorithm 1 shows the detailed procedure of our safe screening with simple fixed step-size gradient descent. (Note that any other optimization algorithm can be combined with our screening procedure.) In the algorithm, for every freq iteration of the gradient descent, the screening rules are evaluated by using the current solution M as the reference solution. As the quality of the approximate solution M improves, the larger the number of triplets that can be removed from T. Thus, the quality of the initial solution affects the efficiency. In the case of the regularization path calculation, in which RTLM is solved for a sequence of λs, a reasonable initial solution is the approximate solution to the previous λ. We discuss a further extension specific to the regularization path calculation in section 6.1.

Considering the computational cost of the screening procedure of algorithm 1, the rule evaluation (step 2) described in section 4 is often dominant, because the rule needs to be evaluated for each one of the triplets. The sphere, constructed in step 1, can be fixed during the screening procedure as long as the reference solution is fixed.

To evaluate the spherical rule, equation 4.1, given the center Q and the radius r, the inner product Hijl,Q and the norm HijlF need to be evaluated. The inner product Hijl,Q can be calculated in O(d2) operations because it is expanded as a sum of quadratic forms: Hijl,Q=(xi-xl)Q(xi-xl)-(xi-xj)Q(xi-xj). Further, we can reuse this term from the objective function Pλ(M) calculation in the case of the DGB, RPB, and RRPB. The norm HijlF can be calculated in O(d) operations, and this is constant throughout the optimization process. Thus, for the DGB, RPB, or RRPB, it is possible to reduce the additional computational cost of the spherical rule for (i,j,l) to O(1) by calculating HijlF beforehand. The computational cost of the spherical rule with the semidefinite constraint (see section 4.2) is that of the SDLS algorithm. The SDLS algorithm needs O(d3) because of the eigenvalue decomposition in every iterative cycle, which may considerably increase the computational cost. The computational cost of the spherical rule with the linear constraint (see section 4.3) is O(d2).

6  Extensions

6.1  Range-Based Extension of Triplet Screening

The screening rules presented in section 4 relate to the problem of a fixed λ. In this section, we regard a screening rule as a function of λ to derive a range of λs in which the screening rule is guaranteed to be satisfied. This is particularly useful for calculating the regularization path for which we need to optimize the metric for a sequence of λs. If a screening rule is satisfied for a triplet (i,j,l) in a range (λa,λb), we can fix the triplet (i,j,l) in L^ or R^ as long as λ is in (λa,λb), without computing the screening rules.

6.1.1  Deriving the Range

Let
Q=A+B1λ
(6.1)
be the general form of the center of a hypersphere for some constant matrices ARd×d and BRd×d and
r2=a+b1λ+c1λ2
(6.2)
be the general form of the radius for some constants aR, bR, and cR. The GB, DGB, RPB, and RRPB can be in this form (details are provided in appendix L, section L.1). Note that in the RRPB, equation 3.1, λ1 is regarded as λ in the general form and λ0 is a constant. The condition of the spherical rule Hijl,Q-rHijlF>1 in equation 4.1 can be rewritten as
Hijl,Q-12>r2HijlF2
with the assumption
Hijl,Q-1>0.
Because Hijl,Q=Hijl,A+Hijl,B1λ, these two inequalities can be transformed into quadratic and linear functions of λ, respectively. The range of λ that satisfies the two inequalities simultaneously represents the range of λ in which a triplet (i,j,l) must be in R*. The following theorem shows the range for the case of RRPB given a reference solution M0, which is an approximate solution for λ0:
Theorem 10
(Range-Based Extension of RRPB). Assuming Hijl,M0-2+HijlFM0F>0 and M0-M0Fε, a triplet (i,j,l) is guaranteed to be in R for the following range of λ:
λλa,λb,
where
λa=λ0M0FHijlF-Hijl,M0+2εHijlFHijl,M0-2+HijlFM0F,λb=λ0M0FHijlF+Hijl,M0HijlFM0F-Hijl,M0+2+2εHijlF.

Refer to section L.2 for the proof. The computational procedure for range-based screening is shown in algorithm 2.

6.1.2  Consideration for Range Extension with Other Bounds

As shown in equation 3.1, the RRPB is based on the optimality ε for the current λ0, and does not depend on the optimality for λ1, which is regarded as λ in the general form of equations 6.1 and 6.2. Because of this property, the RRPB is particularly suitable to range-based screening among the spheres we derived thus far. To calculate ε, equation 3.2 for the RRPB, the duality gap Pλ0(M0)-Dλ0(α0,Γ0) is required. Instead of the original Pλ0(M0)-Dλ0(α0,Γ0), we can use problems with a reduced size, P˜λ0(M0)-D˜λ0(α0,Γ0), for efficient computation, where D˜λ0 is the dual objective in which αi=0 for iR^ and αi=1 for iL^ are fixed. Because the reduced-size problem shares exactly the same optimal solution with the original problems, this gap also provides a valid bound. As a result, we can avoid computing the sum over all triplets in T (e.g., to calculate the loss term in the original primal) to evaluate a bound.

Figure 5:

(a) Suppose that M is the optimal solution for λ0, and some iterative optimization algorithm obtains Mprev in the middle of the optimization process. The circle around Mprev represents the DGB, which contains M. Then the screening rule can eliminate the triplet (i,j,l) because M,Hijl>1 holds for any points in the circle. Now suppose that M0 is the approximate solution we obtain after the optimization algorithm terminates with some small tolerance of the duality gap. The circle with the dashed line represents the region in which the duality gap is less than the tolerance. Although M0 satisfies the terminate condition, the inequality M,Hijl>1 does not hold for M0. In this case, we cannot ignore this triplet (i,j,l) to evaluate the duality gap for different λλ0 because it causes a nonzero penalty. (b) An enlarged bound. Because of the inequality of DGB M-M0F2ε/λ, this enlarged region contains any approximate solutions with the duality gap ε.

Figure 5:

(a) Suppose that M is the optimal solution for λ0, and some iterative optimization algorithm obtains Mprev in the middle of the optimization process. The circle around Mprev represents the DGB, which contains M. Then the screening rule can eliminate the triplet (i,j,l) because M,Hijl>1 holds for any points in the circle. Now suppose that M0 is the approximate solution we obtain after the optimization algorithm terminates with some small tolerance of the duality gap. The circle with the dashed line represents the region in which the duality gap is less than the tolerance. Although M0 satisfies the terminate condition, the inequality M,Hijl>1 does not hold for M0. In this case, we cannot ignore this triplet (i,j,l) to evaluate the duality gap for different λλ0 because it causes a nonzero penalty. (b) An enlarged bound. Because of the inequality of DGB M-M0F2ε/λ, this enlarged region contains any approximate solutions with the duality gap ε.

In the other bounds, the loss term in the primal objective needs to be carefully considered. Suppose that we have an approximate solution M0 for λ0 as a reference solution. To regard a bound as a function of λ in the GB and PGB, it is necessary to consider the gradient for λ (i.e., Pλ(M0)), and the DGB requires the objective value for λ (i.e., Pλ(M0)). These two terms may not be correctly calculated if we replace them with the reduced-size primal created for λ0. Figure 5a illustrates an example of this problem in the case of DGB. To safely replace Pλ(M0) with the reduced-size primal P˜λ(M0) for these cases, the following conditions need to hold:
M0,Hijl<1-γ,for(i,j,l)L^,M0,Hijl>1,for(i,j,l)R^.
If the reference solution M0 is exactly optimal for λ0, these conditions hold. However, in practice, this cannot be true because of numerical errors, and furthermore, the optimization algorithm is usually terminated with some tolerance.
This problem can be avoided by enlarging the radius of spherical bounds such that the bound contains the approximate solution. Assuming that M0 is an approximate solution with the duality gap ε, then from the DGB, we see that the distance between M0 and the optimal solution M satisfies
M-M0F2ελ.
This inequality indicates that by enlarging the radius of the hypersphere by 2ελ, we can guarantee that the bound includes any approximate solutions (Figure 5b shows an illustration). Using the radius r introduced in section 6.1, we obtain the enlarged radius R as follows:
R=a+b1λ+c1λ2r+2ελ.
(6.3)
The reduced-size problems created by this enlarged radius can be safely used to evaluate the duality gap for any λ. However, this enlarged radius no longer has the general form of the radius 6.2 we assumed to derive the range. Although it is possible to derive a range even for the enlarged radius R, the calculation becomes quite complicated, and thus we do not pursue this direction in this study. (Appendix M shows the computational procedure.) Further, an increase in the radius may decrease the screening rate.

6.2  Screening with Diagonal Constraint

When the matrix M is constrained to be a diagonal matrix, metric learning is reduced to feature weighting in which the Mahalanobis distance, equation 2.1, simply adapts a weight of each feature without combining different dimensions. Although correlation in different dimensions is not considered, this simpler formulation is useful to avoid a large computational cost for high-dimensional data mainly because of the following two reasons:

  • The number of variables in the optimization decreases from d2 to d.

  • The semidefinite constraint for a diagonal matrix is reduced to the nonnegative constraint of diagonal elements.

Both properties are also beneficial for efficient screening rule evaluation; in particular, the second property makes the screening rule with the semidefinite constraint easier to evaluate.

The minimization problem of the spherical rule with the semi-definite constraint, equation P2, is simplified as
minxRdxhijls.t.x-q22r2,x0,
(P4)
where hijl:=diag(Hijl). Let
L(x,α,β):=xhijl-α(r2-x-q22)-βx,
be the Lagrange function of equation P4, where α0 and β0 are dual variables. The KKT conditions are written as
L/x=hijl+2α(x-q)-β=0,
(6.4a)
α(r2-x-q22)=0,βkxk=0,
(6.4b)
α0,r2-x-q220,β0,x0.
(6.4c)
We derive the analytical representation of the optimal solution for cases of α>0 and α=0, respectively. For α>0, the following theorem is obtained.
Theorem 11.
If the optimal dual variable satisfies α>0, the optimal x and β of equation P4 can be written as
xk=qk-hijl,k/2α,ifhijl,k-2αqk0,0,otherwise,
(6.5)
and
β=hijl+2α(x-q).
(6.6)
Then x also satisfies
x-q22=r2.
(6.7)

For α=0, the following theorem is obtained.

Theorem 12.
If the optimal dual variable satisfies α=0, the optimal x and β of equation P4 can be written as
xk=0,ifhijl,k>0max{qk,0},otherwise,
(6.8)
and
β=hijl.
(6.9)

The proofs for theorems 12 and 13 are in sections N.1 and N.2, respec-tively.

Based on the theorems, the optimal solution of equation P4 can be calculated analytically. The detail of the procedure is shown in section N.3, which requires O(d2) computations. Although this procedure obtains the solution by using the fixed steps of analytical calculations, for larger values of d, iterative optimization algorithms can be faster. For example, we can apply the SDLS dual ascent to problem P4 in which each iterative step takes O(d).

6.3  Applicability to More General Formulation

Throughout the letter, we analyze screening theorems based on the optimization problem defined by Primal. RTML is the Frobenius norm-regularized triplet loss minimization, which has been shown to be an effective formulation of metric learning (Schultz & Joachims, 2004; Shen et al., 2014). Further, with slight modifications, our screening framework can accommodate a wider range of metric learning methods. Here we redefine the optimization problem as follows:
minMOiM,Ci+λ2MF2,
(6.10)
where CiRd×d is a constant matrix. All our sphere bounds (GB, PGB, DGB, RPB, and RRPB) still hold for this general representation if the loss function :RR is convex subdifferentiable. The rules (spherical rule, sphere with semidefinite constraint, and sphere with linear constraint) can also be constructed if the loss function has the form of the hinge type loss function, equations 2.2 and 2.3, by which standard hinge-, smoothed hinge–, and squared hinge–loss functions are included.
We can incorporate an additional linear term into the objective function 6.10. Defining a pseudo-loss function ˜(x)=-x, we write the primal problem with a linear term as
minMOPλ(M):=ijlM,Hijl+˜(M,C)+λ2MF2,
(6.11)
which can be seen as a special case of equation 6.10 because ˜ is convex subdifferentiable. Suppose that ηij{0,1} indicates whether xj is a target neighbor of xi, which is a neighbor of xi having the same label. When we define C:=-ijηij(xi-xj)(xi-xj) and employ the hinge loss, equation 2.2, this formulation is the well-known LMNN (Weinberger & Saul, 2009) with the additional Frobenius norm regularization. Another interpretation of this linear term is the trace norm regularization (Kulis, 2013). For the pseudo-loss term ˜, the derivative is ˜(x)=-1, and the conjugate is ˜*(-a)=0 if -a=-1; otherwise, , where a is a dual variable. Then, by using the derivation of the dual in appendix A, the dual problem is modified as
max0α1,a=1,ΓODλ(α,Γ):=-γ2α22+α1-λ2Mλ(α,a,Γ)F2,
where
Mλ(α,a,Γ):=1λijlαijlHijl+aC+Γ.
Because equation 6.11 is a special case of equation 6.10, all spheres can be derived, and we can construct the same screening rules for αijl for (i,j,l)T. The only difference is that the dual variable a is not associated with any screening rule because it is fixed to 1 by the dual constraint.
About the loss term, pairwise- and quadruplet-loss functions can also be incorporated into our framework. The pairwise approach considers a set of pairs in the same class S and a set of pairs in the different classes D. Davis et al. (2007) considered constraints with threshold parameters U and L: dM2(xi,xj)U for (i,j)S and dM2(xi,xl)L for (i,l)D. Let t(x)=[t-x]+ be a hinge loss function with threshold t. By using t, the above two constraints can be relaxed to soft constraints that result in
minMO(i,j)S-UM,-Cij+(i,l)DLM,Cil+λ2MF2.
Because of the threshold parameters, the second term of the dual problem, Dual2, changes from α1 to αt, where t:=[L,,L,-U,,-U]R|D|+|S|. Our bounds still hold because t is convex subdifferentiable, and screening rules are formulated as evaluating whether the inner product M,Cil (or M,-Cij) is larger or smaller than the threshold t.
Law et al. (2013) proposed a loss function based on a quadruplet of instances. The basic idea is to compare pairs of dissimilarity dM2(xi,xj) and dM2(xk,xl). For example, when (k,l) should be more dissimilar than (i,j), the loss is defined as (dM2(xk,xl)-dM2(xi,xj)). They define the following optimization problem,
minMO(i,j,k,l)QM,Cijkl+λ2MF2,
where Cijkl=(xk-xl)(xk-xl)-(xi-xj)(xi-xj) and Q is a set of quadruplets. This is also a special case of equation 6.10.

Note that pairwise-, triplet- and quadruplet-loss functions can be used simultaneously, and safe screening can be applied to remove any of those loss terms.

7  Experiment

We evaluate the performance of safe triplet screening using the benchmark data sets listed in Table 2, which are from LIBSVM (Chang & Lin, 2011) and the Keras data set (Chollet et al., 2015). We create a set of triplets by following the approach by Shen et al. (2014), in which k neighborhoods in the same class xj and k neighborhoods in a different class xl are sampled for each xi. We employed the regularization path setting in which RTLM is optimized for a sequence of λ0,λ1,,λT. To determine λ0=λmax, from a sufficiently large λ in which R is empty, we gradually reduced λ by multiplying it by 0.9 and started the regularization path calculation from λ in which R is not empty. To generate the next value of λ, we used λt=0.9λt-1, and the path terminated when the following condition is satisfied:
loss(λt-1)-loss(λt)loss(λt-1)×λt-1λt-1-λt<0.01,
where loss(λt) is the loss function value at λt. We randomly selected 90% of the instances of each data set five times, and the average is shown as the experimental result. As a base optimizer, we employed the projected gradient descent of the primal problem, and the iteration terminated when the duality gap became less than 10-6. For the loss function , we used the smoothed hinge loss of γ=0.05. (We also provide results for the hinge loss in section 7.4.1). We performed safe triplet screening after every 10 iterative cycles of the gradient descent. We refer to the first screening for a specific λt, in which the solution of the previous λt-1 is used as the reference solution for regularization path screening. The screening performed during the optimization process (after regularization path screening) is termed dynamic screening. We performed both of these screening procedures for all experiments. As a baseline, we refer to the RTLM optimization without screening as naive optimization. We initialized with M=O at λ0 because M=O is the optimal solution when λ. When the regularization coefficient changes, M starts from the previous solution M^ (warm start). The step size of the gradient descent was determined by
12ΔM,ΔGΔG,ΔG+ΔM,ΔMΔM,ΔG,
where ΔM=Mt-Mt-1,ΔG=Pλ(Mt)-Pλ(Mt-1) (Barzilai & Borwein, 1988). In SDLS dual ascent, we used the conjugate gradient method (Yang, 1993) to find the minimum eigenvalue.
Table 2:
Summary of Data Sets.
#dimension#sample#classesk#tripletλmaxλmin
Iris 150  546,668 1.3e+2.3e+
Wine 13 178  910,224 2.0e+5.1e+
Segment 19 2310 20 832,000 2.5e+4.2e+
Satimage 36 4435 15 898,200 1.0e+8.8e+
Phishing 68 11,055 487,550 5.0e+2.0e-
SensIT Vehicle 100 78,823 638,469 1.0e+2.9e+
a9a 16a 32,561 732,625 1.2e+3.1e+
Mnist 32a 60,000 10 1,350,025 7.0e+9.6e-
Cifar10 200a 50,000 10 180,004 2.0e+3.3e+
Rcv1.multiclass 200b 15,564 53 126,018 3.0e+6.0e-
#dimension#sample#classesk#tripletλmaxλmin
Iris 150  546,668 1.3e+2.3e+
Wine 13 178  910,224 2.0e+5.1e+
Segment 19 2310 20 832,000 2.5e+4.2e+
Satimage 36 4435 15 898,200 1.0e+8.8e+
Phishing 68 11,055 487,550 5.0e+2.0e-
SensIT Vehicle 100 78,823 638,469 1.0e+2.9e+
a9a 16a 32,561 732,625 1.2e+3.1e+
Mnist 32a 60,000 10 1,350,025 7.0e+9.6e-
Cifar10 200a 50,000 10 180,004 2.0e+3.3e+
Rcv1.multiclass 200b 15,564 53 126,018 3.0e+6.0e-

Note: #triplet and λmin are the average value for subsampled random trials.

aThe dimension was reduced by AutoEncoder.

bThe dimension was reduced by PCA.

7.1  Comparing Rules

We first validate the screening performance (screening rate and CPU time) of each screening rule introduced in section 4 by using algorithm 2 without the range-based screening process. Here, the screening rate is defined by #screened triplets/|{(i,j,l)τHijl,M^>1orHijl,M^<1-γ}| where M^ is the solution after convergence.

7.1.1  GB-Based Rules

Here we use the GB and PGB as spheres, and we observe the effect of the semidefinite constraint in the rules. As a representative result, Figure 6a compares the performance of the rules by using segment data.

Figure 6:

Comparison of screening rules on the segment data set. For both panels a and b, the plots are aligned as follows. (Top left) Performance of regularization path screening. (Bottom left) Ratio of CPU time compared with the naive optimization for each λ. (Right) Enlargement of the upper left plot for the range -log10(λ)[-1,-0.6].

Figure 6:

Comparison of screening rules on the segment data set. For both panels a and b, the plots are aligned as follows. (Top left) Performance of regularization path screening. (Bottom left) Ratio of CPU time compared with the naive optimization for each λ. (Right) Enlargement of the upper left plot for the range -log10(λ)[-1,-0.6].

First, except for the GB, the rules maintain a high screening rate for the entire regularization path, as shown in the top left plot. Note that this rate is only for regularization path screening, meaning that dynamic screening can further increase the screening rate during the optimization, as discussed in the section 7.1.2. The bottom left plot of the same figure shows that PGB and GB+Linear are the most efficient and achieved CPU times approximately 2 to 10 times faster than the naive optimization. The screening rate of the GB was severely reduced along the latter half of the regularization path. As illustrated in Figure 2a, the center of the GB can be external to the semidefinite cone by which the sphere of GB contains a larger proportion of the region violating the constraint MO, compared with the spheres with their center inside the semidefinite cone. This causes performance deterioration particularly for smaller values of λ, because the minimum of the loss term is usually outside the semidefinite cone.

The screening rates of GB+Linear and GB+Semidefinite are slightly higher than that of the PGB (the plot on the right), which can be seen from their geometrical relation illustrated in Figure 2a. GB+Semidefinite achieved the highest screening rate, but eigenvalue decomposition is necessary to repeatedly perform the calculation in SDLS, which resulted in the CPU time increasing along the latter half of the path. Although PGB+Semidefinite is also tighter than PGB, the CPU time increased from approximately -log10(λ)-4 to -3. Because the center of PGB is positive semidefinite, only the minimum eigenvalue is required (see section 4.2), but it can still increase the CPU time.

Among the screening methods compared here, our empirical analysis suggests that the use of the spherical rule with PGB, in which a semidefinite constraint is implicitly incorporated in the projection process, is the most cost-effective. We did not observe that the other approach to considering the semidefinite (or relaxed linear) constraint in the rule substantially outperforms PGB in terms of CPU time despite its high screening rate. We observed the same tendency for DGB. The screening rate did not change markedly even if the semidefinite constraint was explicitly considered.

7.1.2  DGB-Based Rules

Next, by using the DGB, we compared the performance of the three rules presented in section 4. Figure 6b shows the results, which are similar to those obtained for the GB, shown in Figure 6a. The semidefinite and the linear constraint slightly improve the rate. However, the large computational cost for screening with the semidefinite constraint caused the overall CPU time to increase. Therefore, although the linear constraint is much easier to evaluate, the CPU time was almost the same as that required for the DGB because of the slight improvement in the screening rate.

7.2  Comparing Bounds

Here we compare the screening performance (screening rate and CPU time) of each bound introduced in section 3 by using algorithm 2 without the range-based screening process. We do not use RPB because it needs the strictly optimal previous solution.

Based on the results in the previous section, we employed the spherical rule. The result obtained for the phishing data set is shown in Figure 7. The screening rate of the GB (top right) again decreased from the middle of the horizontal axis compared with the other spheres. The other spheres also have lower screening rates for small values of λs. As mentioned in section 6.1, the radii of GB, DGB, RPB, and RRPB have the form r2=a+b1λ+c1λ2, meaning that if λ0, then r. In the case of the PGB, although the dependency on λ cannot be written explicitly, the same tendency was observed. We see that the PGB and RRPB have similar results as suggested by theorem 8, and the screening rate of the DGB is lower than that of the RRPB, as suggested by theorem 9. A comparison of the PGB and RRPB indicated that the former achieved a higher screening rate, but the latter is more efficient with less CPU time, as shown in the plot at the bottom right, because the PGB requires a matrix inner product calculation for each triplet. Bounds other than the GB are more than twice as fast as the naive calculation for most values of λ.

Figure 7:

Comparison of spherical bounds on the phishing data set. The heat maps on the left show the dynamic screening rate. The vertical axes of these heat maps represent the number of iterative steps required for the optimization divided by 10 to perform the screening. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with naıve optimization.

Figure 7:

Comparison of spherical bounds on the phishing data set. The heat maps on the left show the dynamic screening rate. The vertical axes of these heat maps represent the number of iterative steps required for the optimization divided by 10 to perform the screening. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with naıve optimization.

A comparison of the dynamic screening rate (the three plots on the left in Figure 7) of PGB and RRPB shows that the rate of PGB is higher. In terms of the regularization path screening (top right), RRPB and PGB have similar screening rates, but PGB has a higher dynamic screening rate. Along the latter half of the regularization path, the number of gradient descent iterations increases; consequently, the dynamic screening significantly affects the CPU time, and the PGB becomes faster despite the additional computation it requires to compute the inner product.

We further evaluate the performance of the range-based extension described in section 6.1. Figure 8 shows the rate of the range-based screening for the segment data set. The figure shows that a wide range of λ can be screened, particularly for small values of λ; although the range is smaller for large values of λ, than for the small values, a high screening rate is observed when λ approaches λ0. A significant advantage of this approach is that for those triplets screened by using the specified range, we no longer need to evaluate the screening rule as long as λ is within the range.

Figure 8:

Screening rate of range-based screening on the segment data set. The color indicates the screening rate for λ on the vertical axis based on the reference solution using λ0 on the horizontal axis. The accuracy of the reference solution is 10-4 and 10-6 for the plots on the left and right, respectively.

Figure 8:

Screening rate of range-based screening on the segment data set. The color indicates the screening rate for λ on the vertical axis based on the reference solution using λ0 on the horizontal axis. The accuracy of the reference solution is 10-4 and 10-6 for the plots on the left and right, respectively.

The total CPU time for the regularization path is shown in Figure 9. In addition to GB, PGB, DGB, and RRPB, we further evaluate the performance when PGB and RRPB are used simultaneously. The use of two rules can improve the screening rate; however, additional computations are required to evaluate the rule. In the figure, for four out of six data sets, the PGB+RRPB combination requires the least CPU time.

Figure 9:

Total CPU time of regularization path (seconds). The term RRPB+PGB indicates that the spherical rules are performed with these two spheres. “Baseline” indicates the computational time without screening. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

Figure 9:

Total CPU time of regularization path (seconds). The term RRPB+PGB indicates that the spherical rules are performed with these two spheres. “Baseline” indicates the computational time without screening. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

7.3  Evaluating the Practical Efficiency

We next considered a computationally more expensive setting to evaluate the effectiveness of the safe screening approach in a practical situation. To investigate the regularization path more precisely, we set a finer grid of regularization parameters defined as λt=0.99λt-1. We also incorporated the well-known active set heuristics to conduct our experiments on larger data sets. Note that because of the above differences, the computational time shown here cannot be directly compared with the results in sections 7.1 and 7.2. The active set method uses only a subset of triplets of which the loss is greater than 0 as the active set. The gradient is calculated by using only the active set, and the overall optimality is confirmed when the iteration converges. We employed the active set update strategy shown by Weinberger and Saul (2009), in which the active set is updated once every ten iterative cycles.

Table 3 compares the CPU time for the entire regularization path. Based on the results in the previous section, we employed RRPB and RRPB+PGB (evaluating rules based on both spheres) for triplet screening. Further, the range-based screening described in section 6.1 is also performed using RRPB, for which we evaluate the range at the beginning of the optimization for each λ, as shown in algorithm 2. Our safe triplet screening accelerates the optimization process by up to 10 times compared to the simple active set method. The results for higher-dimensional data sets with a diagonal M are presented in section 7.4.1.

Table 3:
Evaluation of the Total CPU Time (Seconds) with the Active Set Method.
Method\Data SetphishingSensITa9amnistcifar10rcv
ActiveSet 7989.5 16,352.1 758.7 3788.1 11085.7 94996.3 
ActiveSet+RRPB 2126.2 3555.6 70.1 871.1 1431.3 43174.9 
ActiveSet+RRPB+PGB 2133.2 3046.9 72.1 897.9 1279.7 38231.1 
Method\Data SetphishingSensITa9amnistcifar10rcv
ActiveSet 7989.5 16,352.1 758.7 3788.1 11085.7 94996.3 
ActiveSet+RRPB 2126.2 3555.6 70.1 871.1 1431.3 43174.9 
ActiveSet+RRPB+PGB 2133.2 3046.9 72.1 897.9 1279.7 38231.1 

Note: Results in bold indicate the fastest method.

7.4  Empirical Evaluation of Three Special Cases

Here we evaluate three special cases of our formulation: nonsmoothed hinge loss, the Mahalanobis distance with a diagonal matrix, and dynamic screening for a certain value of λ.

7.4.1  Nonsmoothed Hinge Loss

In previous experiments, we used the smoothed hinge loss function γ=0.05. However, the hinge loss function γ=0 is also widely used. Figure 10 shows the screening result of the PGB spherical rule for the segment data. Here, the loss function of RTLM is the hinge loss function, and the other settings are the same as those of the experiments in the main text. The results show that PGB achieved a high screening rate and that the CPU time substantially improved.

Figure 10:

Performance evaluation of PGB for the hinge loss setting. The heat map on the left shows the dynamic screening rate with the vertical axis showing the number of iterative cycles for optimization divided by 10 at which screening is performed. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with the naive optimization.

Figure 10:

Performance evaluation of PGB for the hinge loss setting. The heat map on the left shows the dynamic screening rate with the vertical axis showing the number of iterative cycles for optimization divided by 10 at which screening is performed. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with the naive optimization.

7.4.2  Learning with Higher-Dimensional Data Using Diagonal Matrix

Here we evaluate the screening performance when the matrix M is confined to being a diagonal matrix. Based on the same setting as section 7.3, comparison with the ActiveSet method is shown in Table 4. We used RRPB and RRPB+PGB, both of which largely reduced the CPU time. Attempts to process the Gisette data set, which has the largest dimension, 5,000, with the active set method were unsuccessful and the method did not terminate even after 250,000 s.

Table 4:
Total Time (Seconds) of the Regularization Path for Diagonal M.
Method\Data SetUSPSMadelonColon-CancerGisette
ActiveSet 2485.5 7005.8 3149.8 – 
ActiveSet+RRPB 326.7 593.4 632.2 133,870.0 
ActiveSet+RRPB+PGB 336.6 562.4 628.2 127,123.8 
#dimension 256 500 2000 5000 
#samples 7291 2000 62 6000 
#triplet 656,200 720,400 38,696 1,215,225 
k 10 20  15 
λmax 1.0e+7 2.0e+14 5.0e+7 4.5e+8 
λmin 1.9e+3 4.7e+11 7.0e+3 2.1e+3 
Method\Data SetUSPSMadelonColon-CancerGisette
ActiveSet 2485.5 7005.8 3149.8 – 
ActiveSet+RRPB 326.7 593.4 632.2 133,870.0 
ActiveSet+RRPB+PGB 336.6 562.4 628.2 127,123.8 
#dimension 256 500 2000 5000 
#samples 7291 2000 62 6000 
#triplet 656,200 720,400 38,696 1,215,225 
k 10 20  15 
λmax 1.0e+7 2.0e+14 5.0e+7 4.5e+8 
λmin 1.9e+3 4.7e+11 7.0e+3 2.1e+3 

Notes: The results in bold indicate the fastest method. The Gisette data set did not produce results by ActiveSet because of the time limitation.

7.4.3  Dynamic Screening for Fixed λ

Here, we evaluate the performance of dynamic screening for a fixed λ. For λ, we used λmin in Table 2 for which the screening rate was relatively low in our results thus far (e.g., see Figure 6a). Figure 11 compares the computational time of the naive approach without screening and with the dynamic screening shown in algorithm 1. The plots in Figure 11a show that dynamic screening accelerates the learning process. The plots in Figure 11b show the performance of the active set strategy, indicating that the combination of dynamic screening and the active set strategy is effective for further acceleration.

Figure 11:

Evaluation of the computational time for dynamic screening. The computational time required for (a) dynamic screening (a) without the active set and (b) with the active set. “Baseline” indicates the results obtained for the naive method without screening and the active set strategy. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

Figure 11:

Evaluation of the computational time for dynamic screening. The computational time required for (a) dynamic screening (a) without the active set and (b) with the active set. “Baseline” indicates the results obtained for the naive method without screening and the active set strategy. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

7.5  Effect of Number of Triplets on Prediction Accuracy

Finally, we examine the relation between the number of triplets contained in T and the prediction accuracy of the classification. We employed the nearest-neighbor (NN) classifier to measure the prediction performance of the learned metric. The data set was randomly divided into training data (60%), validation data (20%), and test data (20%). The regularization parameter λ changed from 105 to 0.1 and was chosen by minimizing the validation error. The experiment was performed 10 times by randomly partitioning the data set in different ways.

The results are shown in Figure 12, which summarizes the CPU time and test error rate for different settings of the number of triplets. The horizontal axes in all four plots, a to d, represent the number of neighbors k used to define the original triplet set T as described at the beginning of section 7. Figure 12a shows the CPU time to calculate the entire regularization path with and without screening. Here “Without Screening” indicates the ActiveSet approach, and “With Screening” indicates the ActiveSet+RRPB approach. These results show that the learning time increases as k increases, and safe triplet screening shows larger decreases in the CPU time for larger values of k. Figures 12b to 12d show the test error rates, each calculated by 10 NN, 20 NN, and 30 NN classifiers, respectively. In Figure 12b, the 10 NN test error is minimized at k=6, with screening requiring less than approximately 2,000 seconds, whereas the naive approach (Without Screening) can calculate only approximately k=4 in the same computational time. In Figure 12c, the 20 NN test error is minimized at k=12, with screening requiring approximately 4000 seconds, whereas the naıve approach can calculate only approximately k=8. In Figure 12d, the 30 NN test error is minimized at k=15, with screening requiring approximately 5000 seconds, whereas the naïve approach can calculate only approximately k=9. These results indicate that the number of neighbors, k, significantly affects the prediction accuracy, and sufficiently large k is often necessary to achieve the best prediction performance.

Figure 12:

CPU time (seconds) and test error rate on the phishing data set.

Figure 12:

CPU time (seconds) and test error rate on the phishing data set.

8  Conclusion

We introduced safe triplet screening for large-margin metric learning. Three screening rules and six spherical bounds were derived, and the relations among them were analyzed. We further proposed a range-based extension for the regularization path calculation. Our screening technique for metric learning is particularly significant compared with other screening studies because of the large number of triplets and the semidefinite constraint. Our numerical experiments verified the effectiveness of safe triplet screening using several benchmark data sets.

Appendix A:  Dual Formulation

To derive the dual problem, we first rewrite the primal problem as
minM,tijltijl+λR(M)s.t.MO,tijl=M,Hijl,
where t is a |T|-dimensional vector that contains all tijl for (i,j,l)T and
R(M)=12MF2.
(A.1)
The Lagrange function is
L(M,t,α,Γ):=ijltijl+λR(M)+ijlαijl(tijl-M,Hijl)-M,Γ,
where αR|T| and ΓR+d×d are Lagrange multipliers. Let
*(-αijl):=suptijl{(-αijl)tijl-tijl},
(A.2)
R*(Mλ(α,Γ)):=supM{Mλ(α,Γ),M-R(M)},
(A.3)
be convex conjugate functions (Boyd & Vandenberghe, 2004) of and R, where
Mλ(α,Γ):=1λijlαijlHijl+Γ.
(A.4)
Then the dual function is written as
Dλ(α,Γ):=infM,tL(M,t,α,Γ)=-ijlsuptijl{(-αijl)tijl-tijl}-λsupM{Mλ(α,Γ),M-R(M)}=-ijl*(-αijl)-λR*(Mλ(α,Γ)).
From the Karush-Kuhn-Tucker (KKT) condition, we obtain
ML=λR(M)-λMλ(α,Γ)=O,
(A.5a)
tijlL=(tijl)+αijl=0,
(A.5b)
ΓO,MO,M,Hijl=tijl,M,Γ=0,
(A.5c)
where, in the case of hinge loss,
(x)=0,x>1,-c,x=1,-1,x<1,
where c[0,1], and in the case of smoothed hinge loss,
(x)=0,x>1,-1γ(1-x),1-γx1,-1,x<1-γ.
From these two equations and equation A.5b, we see that 0α1. Substituting equation A.5b into equation A.2 and considering the above constraint, the conjugate of the loss function can be transformed into
*(-αijl)=γ2αijl2-αijl.
Note that this equation holds for the cases of both hinge loss (by setting γ=0) and smoothed hinge loss (γ>0). Substituting equation A.5a into A.3, the conjugate of the regularization term R is written as
R*(Mλ(α,Γ))=R(Mλ(α,Γ))=12Mλ(α,Γ)F2.
Therefore, the dual problem is
max0α1,ΓODλ(α,Γ)=-ijl*(-αijl)-λ2Mλ(α,Γ)F2.
(Dual1)
Because the second term, maxΓO-12Mλ(α,Γ)F2, is equivalent to the projection onto a semidefinite cone (Boyd & Xiao, 2005; Malick, 2004), the above problem (Dual1) can be simplified as
max0α1Dλ(α):=-γ2α22+α1-λ2Mλ(α)F2,
(Dual2)
where
Mλ(α):=1λijlαijlHijl+.
For the optimal M, each triplet in T can be categorized into the following three groups:
L:={(i,j,l)THijl,M<1-γ},C:={(i,j,l)T1-γHijl,M1},R:={(i,j,l)THijl,M>1}.
(A.6)
Based on equations A.5b and A.5c in becomes clear that αijl=-(M,Hijl), by which the following rules are obtained:
(i,j,l)Lαijl=1,(i,j,l)Cαijl[0,1],(i,j,l)Rαijl=0.
(A.7)

Appendix B:  Proof of Lemma 1

The reduced-size problem can be represented by
minM,t(i,j,l)T˜(tijl)+(i,j,l)L^1-γ2-tijl+λ2MF2s.t.tijl=M,Hijl(i,j,l)T,MO.
Then the Lagrangian is
L˜(M,t,α,Γ)=(i,j,l)T˜(tijl)+(i,j,l)L^1-γ2-tijl+λ2MF2+(i,j,l)Tαijl(tijl-M,Hijl)-M,Γ.
(B.1)
The dual function is written as
D˜λ(α,Γ):=infM,tL˜(M,t,α,Γ)=-(i,j,l)T˜suptijl{(-αijl)tijl-tijl}-(i,j,l)R^suptijl{(-αijl)tijl}-(i,j,l)L^suptijl{(1-αijl)tijl}+(1-γ2)|L^|-λsupM{Mλ(α,Γ),M-R(M)},
where R(M) and and Mλ(α,Γ) are defined by equations A.1 and A.4, respectively. Based on the second and third terms of the previous equation, we see
αijl=0,(i,j,l)R^,
(B.2)
αijl=1,(i,j,l)L^,
(B.3)
which prevent D˜λ from approaching . Then constraints B.2 and B.3 enable us to further transform the dual objective into
D˜λ(α,Γ)=-(i,j,l)T˜*(-αijl)+(i,j,l)L^1-γ2-λ2Mλ(α,Γ)F2=-γ2αT˜22+αT˜1+1-γ2|L^|-λ2Mλ(α,Γ)F2=-γ2α22+α1-λ2Mλ(α,Γ)F2.
Thus, the dual problem is written as
max0α1,ΓODλ(α,Γ)=-γ2α22+α1-λ2Mλ(α,Γ)F2s.t.αL^=1,αR^=0.
(B.4)
This is the same optimization problem as Dual1 except that αL^ and αR^ are fixed as the optimal value in Dual1. This obviously indicates that problems B.4 and Dual1 have the same optimal solution. Given the optimal dual variables α and Γ, the optimal primal M can be derived by
M=1λ(i,j,l)TαijlHijl+Γ,
(B.5)
which is from ML˜=0. Because equation B.5 is exactly the same transformation as equation 2.4, the same optimal primal M must be obtained.

Appendix C:  Proof of Theorem 2 (GB)

The following theorem is a well-known optimality condition for the general convex optimization problem:

Theorem 12
(Optimality Condition of Convex Optimization, Bertsekas, 1999). In the minimization problem minxFf(x) where the feasible region F and the function f(x) are convex, and the necessary and sufficient condition that x is the optimal solution is
f(x)f(x)f(x)(x-x)0,xF,
where f(x) represents the set of subgradients in x.
From theorem 14, the following holds for the optimal solution M:
Pλ(M),M-M0,MO.
(C.1)
Let Ξijl(M) be the subgradient of the loss function (M,Hijl) at M. Then Pλ(M) is written as
Pλ(M)=ijlΞijl(M)+λM.
(C.2)
From the convexity of the (smoothed) hinge loss function (M,Hijl), we obtain
(M,Hijl)(M,Hijl)+Ξijl(M),M-M,(M,Hijl)(M,Hijl)+Ξijl(M),M-M,
for any subgradient. The addition of these two equations shows that
Ξijl(M),M-MΞijl(M),M-M.
(C.3)
Combining equations C.1, to C.3 results in
ijlΞijl(M)+λM,M-M0Pλ(M)-λM+λM,M-M0.
By transforming this inequality based on completing the square, we obtain GB.

Appendix D:  Proof of Theorem 3 (PGB)

Let QGB be the center of the GB hypersphere and rGB be the radius. The optimal solution exists in the following set:
{XX-QGBF2rGB2,XO}.
(D.1)
By transforming the sphere of GB, we obtain
X-QGBF2=X-(Q+GB+Q-GB)F2=X-Q+GBF2+2X,-Q-GB+2Q+GB,Q-GB+Q-GBF2.
Because XO and -Q-GBO, we see X,-Q-GB0. Furthermore, using Q+GB,Q-GB=0, we obtain the following sphere:
rGB2X-QGBF2X-Q+GBF2+Q-GBF2.X-Q+GBF2rGB2-Q-GBF2.
Letting QPGB:=Q+GB and rPGB2:=rGB2-Q-GBF2, PGB is obtained. Note that by considering X,-Q-GB0 instead of XO in equation D.1, we can immediately see that GB with the linear constraint X,-Q-GB0 is tighter than PGB.

Appendix E:  Proof of Theorem 5 (DGB)

In general, a function f(x) is an m-strongly convex function if f(x)-m2x22 is convex. Because the objective function Pλ(M) is a λ-strongly convex function, we obtain
Pλ(M)Pλ(M)+Pλ(M),M-M+λ2M-MF2.
From the optimal condition, equation C.1, the second term on the right-hand side is greater than or equal to 0, and from the weak duality, Pλ(M)Dλ(α,Γ). Therefore, we obtain theorem 5.

Appendix F:  Constrained Duality Gap Bound (CDGB)

For the DGB, we show that if the primal and dual reference solutions satisfy equation 2.4, the radius can be 2 times smaller. We extend the dual-based screening of SVM (Zimmert et al., 2015) for RTLM.

Theorem 14
(CDGB). Let α and Γ be the feasible solutions of the dual problem. Then the optimal solution of the primal problem M exists in the following hypersphere:
M-Mλ(α,Γ)F2GDλ(α,Γ)/λ.
Proof.
Let GDλ(α,Γ):=Pλ(Mλ(α,Γ))-Dλ(α,Γ) be the duality gap as a function of the dual feasible solutions α and Γ. The following equation is the duality gap as a function of the primal feasible solution M in which the dual solutions are optimized:
GPλ(M):=min0α1,ΓO,Mλ(α,Γ)=MGDλ(α,Γ)=Pλ(M)-max0α1,ΓO,Mλ(α,Γ)=MDλ(α,Γ).
From the definition, we obtain
GDλ(α,Γ)GPλ(Mλ(α,Γ)).
(F.1)
From the strong convexity of GPλ shown in section F.1, the following holds for any Mλ(α,Γ) and MO:
GPλ(Mλ(α,Γ))GPλ(M)+GPλ(M),Mλ(α,Γ)-M+λMλ(α,Γ)-MF2.
We assume that M is the optimal solution of the primal problem. Then, because M is also a solution to the convex optimization problem minMOGPλ(M), it becomes clear that GPλ(M),Mλ(α,Γ)-M0 from theorem 14. Considering GPλ(M)=0 and GDλ(α,Γ)GPλ(Mλ(α,Γ)), both of which are from the definition, we obtain
GDλ(α,Γ)GPλ(Mλ(α,Γ))λMλ(α,Γ)-MF2.
Dividing by λ, CDGB is derived.

We name this bound the constrained duality gap bound (CDGB), of which the radius converges to 0 at the optimal solution, because the CDGB also has a radius proportional to the square root of the duality gap. For primal-based optimizers, additional calculation is necessary for Pλ(Mλ(α,Γ)), whereas dual-based optimizers calculate this term in the optimization process.

F.1  Proof of Strong Convexity of GPλ

We first define an m-strongly convex function as follows:

Definition 1

(m-strongly Convex Function). When f(x)-m2x22 is a convex function, f(x) is an m-strongly convex function.

According to definition 16, to show that GPλ is strongly convex, we need to show that the term other than λMF2 is convex:
GPλ(M)=ijl(M,Hijl)convex+λMF2+min0α1,ΓO,Mλ(α,Γ)=Mijl*(-αijl):=g(α):=f(M).
Because the loss is convex, we need to show that f(M) is convex. This can be shown as
f(M)=min0α1,ΓO,1λijlαijlHijl+Γ=Mg(α)=min0α1,1λijlαijlHijlMg(α).
Consider a point M2=tM0+(1-t)M1(t[0,1]), which internally divides two points M0 and M1. Let
αi:=argmin0α1,1λijlαijlHijlMi,g(α),
which means that αi is the minimizer of this problem for a given Mi(i{0,1,2}), and from the definition, we see f(Mi)=g(αi). Further, let α2=tα0+(1-t)α1. Then, 0α21 and 1λijlα2,ijlHijlM2. Because g is convex because of the convexity of *, we have
tf(M0)+(1-t)f(M1)=tg(α0)+(1-t)g(α1)g(tα0+(1-t)α1α2)g(α2)=f(tM0+(1-t)M1M2).
Hence, f(M) is convex and GPλ is a strongly convex function.

Appendix G:  Proof of Theorem 6 (RPB)

The optimality condition, theorem 13, in the dual problem, Dual1, for λ0,λ1 determines that
αDλ0(α0,Γ0)(α1-α0)+ΓDλ0(α0,Γ0),Γ1-Γ00,αDλ1(α1,Γ1)(α0-α1)+ΓDλ1(α1,Γ1),Γ0-Γ10.
By adding these two equations, we obtain
[αDλ0(α0,Γ0)-αDλ1(α1,Γ1)](α1-α0)+ΓDλ0(α0,Γ0)-ΓDλ1(α1,Γ1),Γ1-Γ00.
Next, we consider the following difference of gradient:
αijlDλ0(α0,Γ0)-αijlDλ1(α1,Γ1)=-γ(α0ijl-α1ijl)-Hijl,M0-M1,ΓDλ0(α0,Γ0)-ΓDλ1(α1,Γ1)=-(M0-M1).
Defining Ht:=ijlαtijlHijl,Mt is rewritten as Mt=1λt[Ht+Γt]. Then
γα1-α022-H1-H0,M0-M1-M0-M1,Γ1-Γ00γα1-α022-λ1M1-λ0M0,M0-M10-λ1M1-λ0M0,M0-M10.
Transformation of this inequality based on completing the square allows the RPB to be obtained.

Appendix H:  Proof of Theorem 7 (RRPB)

Considering a hypersphere that expands the RPB radius by λ0+λ12λ1ε and replaces the RPB center with λ0+λ12λ1M0, we obtain
M1-λ0+λ12λ1M0F|λ0-λ1|2λ1M0F+λ0+λ12λ1ε.
Because ε is defined by M0-M0Fε, this sphere covers any RPB created by M0, which satisfies M0-M0Fε (see Figure 2d for a geometrical illustration). Using the reverse triangle inequality,
M0F-M0FM0-M0Fε,
we obtain
M1-λ0+λ12λ1M0F|λ0-λ1|2λ1(M0F+ε)+λ0+λ12λ1ε.
By rearranging this, RRPB is obtained.

Appendix I:  Proof of Theorem 8 (Relationship between PGB and RPB)

When the dual variable is used as the subgradient of the (smoothed) hinge loss at the optimal solution M0 of λ0 (from equation A.7, the optimal dual variable provides a valid subgradient), the gradient of the objective function in the case of λ1 is written as
Pλ1(M0)=-H0+λ1M0,
where
H0:=-ijl(M0,Hijl)Hijl=ijlα0ijlHijl.
Because λ0M0=H0+,
Pλ1(M0)=-(H0++H0-)+λ1M0=(λ1-λ0)M0-H0-.
Then the center and radius of GB are
QGB=M0-12λ1Pλ1(M0)=(λ0+λ1)M0+H0-2λ1,rGB2=(λ1-λ0)M0-H0-F24λ12=(λ1-λ0)M0F2-2(λ1-λ0)M0,H0-+H0-F24λ12=(λ0-λ1)M0F2+H0-F24λ12.
Here, the last equation of rGB2 uses the fact that M0* and H0*- are orthogonal. Using QGB and rGB2, the center and radius of PGB are found to be
QPGB=Q+GB=(λ0+λ1)M02λ1,Q-GB=H0-2λ1,rPGB2=rGB2-Q-GBF2=(λ0-λ1)M0F24λ12.
Therefore, PGB coincides with RPB.

Appendix J:  Proof of Theorem 9 (Relationship between DGB and RPB)

At the optimal solution M0,α0 and Γ0 of λ0, we obtain the following equation from Pλ0(M0)=Dλ0(α0,Γ0) and Mλ0(α0,Γ0)=M0:
ijl(M0,Hijl)+ijl*(-α0ijl)=-λ0M0F2.
We also see Mλ1(α0,Γ0)=λ0λ1Mλ0(α0,Γ0)=λ0λ1M0. Using these results, the value of the duality gap for λ1 is
Pλ1(M0)-Dλ1(α0,Γ0)=(λ0-λ1)22λ1M0F2.
Therefore, the radius of DGB rDGB and the radius of RPB rRPB satisfy the following relationship:
rDGB2=2(Pλ1(M0)-Dλ1(α0,Γ0))λ1=(λ0-λ1)2λ12M0F2=4rRPB2.
Furthermore, the centers of these hyperspheres are
QDGB=M0,QRPB=λ0+λ12λ1M0,
and the distance between the centers is
QDGB-QRPBF=|λ0-λ1|2λ1M0F=rRPB.
Thus, the DGB includes the RPB as illustrated in Figure 2c.

Appendix K:  Proof of Theorem 10

The Lagrange function is defined as
L(X,α,β):=X,Hijl-α12(r2-X-QF2)-βP,X.
From the KKT condition, we obtain
L/X=Hijl+α(X-Q)-βP=O.
(K.1a)
α0,β0,X-QF2r2,P,X0.
(K.1b)
α(r2-X-QF2)=0,βP,X=0.
(K.1c)
If α=0, then Hijl=βP from equation K.1a, and the value of the objective function becomes X,Hijl=βX,P=0 from equation K.1c. Let us consider the case of α0. From equation K.1c, it becomes clear that X-QF2=r2. If β=0, the linear constraint is not an active constraint (i.e., P,X>0 at the optimal); hence, it is the same as problem P1, which can be analytically solved. If this solution satisfies the linear constraint P,X0, it becomes the optimal solution. Next, we consider the case of β0. From equations K.1a and K.1c, α and β are obtained as
α=±PF2HijlF2-P,Hijl2r2PF2-P,Q2,β=P,Hijl-αP,QPF2.
Of the solutions of the two values of α, α>0 gives the minimum value from equation K.1b.

Appendix L:  Range-Based Extension

L.1  Generalized Form of GB, DGB, RPB, and RRPB

L.1.1  GB

The gradient is written as
Pλ(M)=Ξ+λM.
Then, the squared norm of this gradient is
Pλ(M)F2=ΞF2+2λΞ,M+λ2MF2.
By substituting this into the center and the radius of GB, we obtain
rGB2=14λ2Pλ(M)F2=14λ2ΞF2+2λΞ,M+λ2MF2=14MF2+12λΞ,M14λ2ΞF2,QGB=M-12λ(Ξ+λM)=12M-12λΞ.

L.1.2  DGB

The duality gap is written as
gap=ijl((M,Hijl)+*(-αijl))+λ2MF2+12λijlαijlHijl+ΓF2.
Then the center and radius of DGB are
rDGB2=2gapλ=2λ(ijl((M,Hijl)+*(-αijl))+λ2MF2+12λijlαijlHijl+ΓF2)=MF2+2λijl((M,Hijl)+*(-αijl))+1λ2ijlαijlHijl+ΓF2,QDGB=M.

L.1.3  RPB

With respect to RPB, we regard λ1 as the target λ for which we consider the range. From the definition, we see
QRPB=λ0+λ2λM0=12M0+λ02λM0,rRPB=λ0-λ2λM0F=-12M0F+λ02λM0F.

L.1.4  RRPB

Here again, we regard λ1 as the target λ for which we consider the range. First, we assume λλ0. Then we have
QRRPB=λ0+λ2λM0=12M0+λ02λM0,rRRPB=λ0-λ2λM0F+λ0λε=-12M0F+1λλ02M0F+λ0ε.
In the case of λλ0, we have
QRRPB=12M0+λ02λM0,rRRPB=λ-λ02λM0F+ε=ε+12M0F-λ02λM0F.

L.2  Proof of Theorem 11 (Range-Based Extension of RRPB)

In RRPB, we replace λ1 with λ and assume λλ0. Then,
QRRPB=λ0+λ2λM0,rRRPB=λ0-λ2λM0F+λ0λε.
From the spherical rule, equation 4.1, we obtain
λ+λ02λHijl,M0-λ0-λ2λM0F+λ0λεHijlF>1(Hijl,M0-2+HijlFM0F)>0isrequired.λ>λ0(M0FHijlF-Hijl,M00+2εHijlF).
The Cauchy-Schwarz inequality determines that the right-hand side is equal to or greater than 0; therefore, the left-hand side must be greater than 0:
λ0λ>λ0(M0FHijlF-Hijl,M0+2εHijlF)Hijl,M0-2+HijlFM0F.
In the case of λλ0,
QRRPB=λ0+λ2λM0,rRRPB=λ-λ02λM0F+ε.
From spherical rule 4.1,
λ+λ02λHijl,M0-λ-λ02λM0F+εHijlF>1(HijlFM0F-Hijl,M00+2+2εHijlF)λ<λ0(M0FHijlF+Hijl,M0).
Similarly, the Cauchy-Schwarz inequality determines that the left-hand side is greater than 0:
λ0λ<λ0(M0FHijlF+Hijl,M0)HijlFM0F-Hijl,M0+2+2εHijlF.

Appendix M:  Calculation for Range-Based Extension Other Than RPB and RRPB

The spherical rule is written as
Hijl,Q-RHijlF>1(i,j,l)R.
This inequality is equivalent to the following two inequalities:
(Hijl,Q-1)2>R2HijlF2,Hijl,Q>1.
By using equations 6.1 and 6.3, these inequalities can be transformed into
Hijl,A+B1λ-12>a+b1λ+c1λ2+2ελ2HijlF2,
(M.1)
Hijl,A+B1λ>1.
(M.2)
Note that the definitions of a, b, c, A, and B for each bound are shown in section L.1. Because inequality M.2 can be written as a linear inequality of λ, we can easily obtain the range of λ that satisfies the inequality. On the other hand, inequality M.1 is equivalent to
Hijl,A+B1λ-12>a+b1λ+c1λ2+2ελ+2a+b1λ+c1λ22ελHijlF2Hijl,A+B1λ-12-a+b1λ+c1λ2+2ελHijlF2>2a+b1λ+c1λ22ελHijlF2.
The last inequality can be transformed into the following two inequalities:
Hijl,A+B1λ-12-a+b1λ+c1λ2+2ελHijlF22>4a+b1λ+c1λ22ελHijlF4,
(M.3)
Hijl,A+B1λ-12>a+b1λ+c1λ2+2ελHijlF2.
(M.4)
Inequality M.4 is also a quadratic inequality for which we can obtain the range of λ that satisfies the inequality. Although inequality M.3 is a fourth-order inequality, the range of λ can be calculated by using a fourth-order equation solver. Then we obtain the range of λ as the intersection of the ranges derived from equations M.2 to M.4.

Appendix N:  Spherical Rule with Semidefinite Constraint for Diagonal Case

N.1  Proof of Theorem 12

Rearranging equation 6.4a, we obtain
βk=2αxk+(hijl,k-2αqk).
When we assume hijl,k-2αqk>0, we see
hijl,k-2αqk>0βk=2αxk0+(hijl,k-2αqk)>0>0xk=0.
The previous equation is derived from the complementary condition βkxk=0. When we assume hijl,k-2αqk0, we have
hijl,k-2αqk02αxk-βk=-(hijl,k-2αqk)0βk=0xk=qk-hijl,k/2α.
The third equation, βk=0, is derived from xk0, βk0 and the complementary condition βkxk=0, and in the previous equation, the assumption α>0 is used. Using the above two derivations, we obtain equation 6.5. Further, from the complementary condition α(r2-x-q22)=0, it is clear that x-q22=r2 because of the assumption α>0.

N.2  Proof of Theorem 13

Because we assume that α=0, we obtain
β=hijl
by using the KKT condition, equation 6.4a. Note that this implicitly indicates that hijl0 should be satisfied because of the nonnegativity of β. The complementary condition xkβk=0 determines that
xk=0ifhijl,k>0.
(N.1)
To satisfy all the KKT conditions, equation 6.4, we need to set the other xk in such a way that x-q22r2 and x0 are satisfied. Note that the other conditions in equation 6.4 are satisfied for any x because of the assumption α=0. By setting
xk=max{qk,0}forhijl,k=0,
(N.2)
x-q22 is minimized under the conditions x0 and equation N.1, and thus the condition x-q22r2 should be satisfied when the optimal α is 0.

N.3  Analytical Procedure of Rule Evaluation for the Diagonal Case

We first verify the case α=0. If the solution equations 6.8 and 6.9 satisfy all the KKT conditions, equation 6.4, then the solution is optimal. Otherwise, we consider the case α>0. Let S:={kxk>0} be the support set of x, where xk is defined by equation 6.5. When S is regarded as a function of α, an element of S can change at which α satisfies
hijl,k-2αqk=0
for some k[d]. Let α1α2αd' for d'd be a sequence of those change points that can be found by sorting
hijl,k2qk|hijl,k2qk>0,qk0,k[d].
(N.3)
For notational convenience, we define α0:=0. Based on the definition, S is fixed for any α in an interval (αk,αk+1), to which we refer as Sk. This means that the support set of the optimal x should be one of Sk for k=1,,d'. Algorithm 3 shows an analytical procedure for calculating the optimal x, which verifies the optimality of each one of Sk after considering the case of α=0. For each iterative cycle in algorithm 3, O(d) computation is required, and thus the solution can be found by O(d2).

Acknowledgments

This work was financially supported by grants from the Japanese Ministry of Education, Culture, Sports, Science and Technology awarded to I.T. (16H06538, 17H00758) and M.K. (16H06538, 17H04694); from the Japan Science and Technology Agency (JST) CREST awarded to I.T. (JPMJCR1302, JPMJCR1502) and PRESTO awarded to M.K. (JPMJPR15N2); from the Materials Research by Information Integration Initiative (MI2I) project of the Support Program for Starting Up Innovation Hub from the JST awarded to I.T. and M.K.; and from the RIKEN Center for Advanced Intelligence Project awarded to I.T.

References

Barzilai
,
J.
, &
Borwein
,
J. M.
(
1988
).
Two-point step size gradient methods
.
IMA Journal of Numerical Analysis
,
8
(
1
),
141
148
.
Bertsekas
,
D. P.
(
1999
).
Nonlinear programming
.
Belmont
:
Athena Scientific
.
Boyd
,
S.
, &
Vandenberghe
,
L.
(
2004
).
Convex optimization
.
Cambridge
:
Cambridge University Press
.
Boyd
,
S.
, &
Xiao
,
L.
(
2005
).
Least-squares covariance matrix adjustment
.
SIAM Journal on Matrix Analysis and Applications
,
27
(
2
),
532
546
.
Capitaine
,
H. L.
(
2016
).
Constraint selection in metric learning
.
arXiv:1612.04853
.
Chang
,
C.-C.
, &
Lin
,
C.-J.
(
2011
).
Libsvm: A library for support vector machines
.
ACM Transactions on Intelligent Systems and Technology
,
2
(
3
),
27
.
Chollet
,
F.
et al
, et al. (
2015
).
Keras
. https://github.com/keras-team/keras.
Davis
,
J. V.
,
Kulis
,
B.
,
Jain
,
P.
,
Sra
,
S.
, &
Dhillon
,
I. S.
(
2007
).
Information-theoretic metric learning
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
209
216
).
New York
:
ACM
.
Fercoq
,
O.
,
Gramfort
,
A.
, &
Salmon
,
J.
(
2015
).
Mind the duality gap: Safer rules for the lasso
. In
Proceedings of the 32nd International Conference on Machine Learning
, (pp.
333
342
).
Ghaoui
,
L. E.
,
Viallon
,
V.
, &
Rabbani
,
T.
(
2010
).
Safe feature elimination for the lasso and sparse supervised learning problems
.
arXiv:1009.4219
.
Hanada
,
H.
,
Shibagaki
,
A.
,
Sakuma
,
J.
, &
Takeuchi
,
I.
(
2018
).
Efficiently monitoring small data modification effect for large-scale learning in changing environment
. In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence
(pp.
1314
1321
).
Palo Alto, CA
:
AAAI Press
.
Hoffer
,
E.
, &
Ailon
,
N.
(
2015
).
Deep metric learning using triplet network
. In
Proceedings of the International Workshop on Similarity-Based Pattern Recognition
(pp.
84
92
).
Berlin
:
Springer
.
Jain
,
L.
,
Mason
,
B.
, &
Nowak
,
R.
(
2017
). Learning low-dimensional metrics. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
4139
4147
).
Red Hook, NY
:
Curran
.
Jain
,
P.
,
Kulis
,
B.
,
Dhillon
,
I. S.
, &
Grauman
,
K.
(
2009
). Online metric learning and fast similarity search. In
D.
Koller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems
,
21
(pp.
761
768
).
Cambridge, MA
:
MIT Press
.
Jamieson
,
K. G.
, &
Nowak
,
R. D.
(
2011
).
Low-dimensional embedding using adaptively selected ordinal data
. In
Proceedings of the 2011 49th Annual Allerton Conference on Communication, Control, and Computing
(pp.
1077
1084
).
Piscataway, NJ
:
IEEE
.
Kulis
,
B.
(
2013
).
Metric learning: A survey
.
Boston
:
Now Publishers
.
Law
,
M. T.
,
Thome
,
N.
, &
Cord
,
M.
(
2013
).
Quadruplet-wise image similarity learning
. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
249
256
).
Piscataway, NJ
:
IEEE
.
Lee
,
S.
, &
Xing
,
E. P.
(
2014
).
Screening rules for overlapping group lasso
.
arXiv:1410.6880
.
Lehoucq
,
R. B.
, &
Sorensen
,
D. C.
(
1996
).
Deflation techniques for an implicitly restarted Arnoldi iteration
.
SIAM Journal on Matrix Analysis and Applications
,
17
(
4
),
789
821
.
Li
,
D.
, &
Tian
,
Y.
(
2018
).
Survey and experimental study on metric learning methods
.
Neural Networks
,
105
,
447
462
.
Liu
,
J.
,
Zhao
,
Z.
,
Wang
,
J.
, &
Ye
,
J.
(
2014
).
Safe screening with variational inequalities and its application to lasso
. In
Proceedings of the International Conference on Machine Learning
(pp.
289
297
).
Malick
,
J.
(
2004
).
A dual approach to semidefinite least-squares problems
.
SIAM Journal on Matrix Analysis and Applications
,
26
(
1
),
272
284
.
McFee
,
B.
, &
Lanckriet
,
G. R.
(
2010
).
Metric learning to rank
. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
775
782
).
Madison, WI
:
Omnipress
.
Nakagawa
,
K.
,
Suzumura
,
S.
,
Karasuyama
,
M.
,
Tsuda
,
K.
, &
Takeuchi
,
I.
(
2016
).
Safe pattern pruning: An efficient approach for predictive pattern mining
. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1785
1794
).
New York
:
ACM
.
Ndiaye
,
E.
,
Fercoq
,
O.
,
Gramfort
,
A.
, &
Salmon
,
J.
(
2016
). Gap safe screening rules for sparse-group lasso. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
388
396
).
Red Hook, NY
:
Curran
.
Ogawa
,
K.
,
Suzuki
,
Y.
, &
Takeuchi
,
I.
(
2013
).
Safe screening of non-support vectors in pathwise SVM computation
. In
Proceedings of the 30th International Conference on Machine Learning
(pp.
1382
1390
).
Okumura
,
S.
,
Suzuki
,
Y.
, &
Takeuchi
,
I.
(
2015
).
Quick sensitivity analysis for incremental data modification and its application to leave-one-out CV in linear classification problems
. In
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
885
894
).
New York
:
ACM
.
Perrot
,
M.
, &
Habrard
,
A.
(
2015
). Regressive virtual metric learning. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
1810
1818
).
Red Hook, NY
:
Curran
.
Schroff
,
F.
,
Kalenichenko
,
D.
, &
Philbin
,
J.
(
2015
).
Facenet: A unified embedding for face recognition and clustering
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
815
823
).
Piscataway, NJ
:
IEEE
.
Schultz
,
M.
, &
Joachims
,
T.
(
2004
). Learning a distance metric from relative comparisons. In
S.
Thrun
,
L. K.
Saul
, &
B.
Schölkopf
(Eds.),
Advances in neural information processing systems
,
16
(pp.
41
48
).
Cambridge, MA
:
MIT Press
.
Shen
,
C.
,
Kim
,
J.
,
Liu
,
F.
,
Wang
,
L.
, & Van Den
Hengel
,
A.
(
2014
).
Efficient dual approach to distance metric learning
.
IEEE Transactions on Neural Networks and Learning Systems
,
25
(
2
),
394
406
.
Shi
,
Y.
,
Bellet
,
A.
, &
Sha
,
F.
(
2014
).
Sparse compositional metric learning
. In
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence
.
Palo Alto, CA
:
AAAI Press
.
Shibagaki
,
A.
,
Karasuyama
,
M.
,
Hatano
,
K.
, &
Takeuchi
,
I.
(
2016
).
Simultaneous safe screening of features and samples in doubly sparse modeling
. In
Proceedings of the 33rd International Conference on Machine Learning
(pp.
1577
1586
).
Shibagaki
,
A.
,
Suzuki
,
Y.
,
Karasuyama
,
M.
, &
Takeuchi
,
I.
(
2015
). Regularization path of cross-validation error lower bounds. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 28
(pp.
1675
1683
).
Red Hook, NY
:
Curran
.
Takada
,
T.
,
Hanada
,
H.
,
Yamada
,
Y.
,
Sakuma
,
J.
, &
Takeuchi
,
I.
(
2016
).
Secure approximation guarantee for cryptographically private empirical risk minimization
. In
Proceedings of the 8th Asian Conference on Machine Learning
(pp.
126
141
).
Wang
,
J.
,
Wonka
,
P.
, &
Ye
,
J.
(
2014
).
Scaling SVM and least absolute deviations via exact data reduction
. In
Proceedings of the International Conference on Machine Learning
(pp.
523
531
).
Wang
,
J.
,
Zhou
,
J.
,
Wonka
,
P.
, &
Ye
,
J.
(
2013
). Lasso screening rules via dual polytope projection. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
1070
1078
).
Red Hook, NY
:
Curran
.
Weinberger
,
K. Q.
, &
Saul
,
L. K.
(
2009
).
Distance metric learning for large margin nearest neighbor classification
.
Journal of Machine Learning Research
,
10
,
207
244
.
Xiang
,
Z. J.
,
Wang
,
Y.
, &
Ramadge
,
P. J.
(
2017
).
Screening tests for lasso problems
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
39
(
5
),
1008
1027
.
Xing
,
E. P.
,
Jordan
,
M. I.
,
Russell
,
S. J.
, &
Ng
,
A. Y.
(
2003
). Distance metric learning with application to clustering with side-information. In
S.
Becker
,
S.
Thrun
, &
K.
Obermayer
(Eds.),
Advances in neural information processing systems
,
15
(pp.
521
528
).
Cambridge, MA
:
MIT Press
.
Yang
,
H.
(
1993
).
Conjugate gradient methods for the Rayleigh quotient minimization of generalized eigenvalue problems
.
Computing
,
51
(
1
),
79
94
.
Zhang
,
W.
,
Hong
,
B.
,
Liu
,
W.
,
Ye
,
J.
,
Cai
,
D.
,
He
,
X.
, &
Wang
,
J.
(
2016
).
Scaling up sparse support vector machines by simultaneous feature and sample reduction
.
arXiv:1607.06996
.
Zhou
,
Q.
, &
Zhao
,
Q.
(
2015
).
Safe subspace screening for nuclear norm regularized least squares problems
. In
Proceedings of the International Conference on Machine Learning
(pp.
1103
1112
).
Zimmert
,
J.
,
de Witt
,
C. S.
,
Kerg
,
G.
, &
Kloft
,
M.
(
2015
).
Safe screening for support vector machines
. In
NIPS 2015 workshop on optimization in machine learning
.
Red Hook, NY
:
Curran
.