Distance metric learning has been widely used to obtain the optimal distance function based on the given training data. We focus on a triplet-based loss function, which imposes a penalty such that a pair of instances in the same class is closer than a pair in different classes. However, the number of possible triplets can be quite large even for a small data set, and this considerably increases the computational cost for metric optimization. In this letter, we propose safe triplet screening that identifies triplets that can be safely removed from the optimization problem without losing the optimality. In comparison with existing safe screening studies, triplet screening is particularly significant because of the huge number of possible triplets and the semidefinite constraint in the optimization problem. We demonstrate and verify the effectiveness of our screening rules by using several benchmark data sets.

Using an appropriate distance function is essential for various machine learning tasks. For example, the performance of a $k$-nearest neighbor ($k$-NN) classifier, one of the most standard classification methods, depends crucially on the distance between different input instances. The simple Euclidean distance is usually employed, but it is not necessarily optimal for a given data set and task. Thus, the adaptive optimization of the distance metric based on supervised information is expected to improve the performance of machine learning methods including $k$-NN.

Distance metric learning (Weinberger & Saul, 2009; Schultz & Joachims, 2004; Davis, Kulis, Jain, Sra, & Dhillon, 2007; Kulis, 2013) is a widely accepted technique for acquiring the optimal metric from observed data. The standard problem setting is to learn the following parameterized Mahalanobis distance,
$dM(xi,xj):=(xi-xj)⊤M(xi-xj),$
where $xi$ and $xj$ are $d$-dimensional feature vectors and $M∈Rd×d$ is a positive semidefinite matrix. This approach has been applied to tasks such as classification (Weinberger & Saul, 2009), clustering (Xing, Jordan, Russell, & Ng, 2003), and ranking (McFee & Lanckriet, 2010). These studies show that the optimized distance metric improves the prediction performance of each task. Metric optimization has also attracted wide interest, even from researchers engaged in recent deep network studies (Schroff, Kalenichenko, & Philbin, 2015; Hoffer & Ailon, 2015).
The seminal work of distance metric learning (Weinberger & Saul, 2009) presents a triplet-based formulation. A triplet $(i,j,l)$ is defined by the pair $xi$ and $xj$, which have the same label (same class), and $xl$, which has a different label (different class). For a triplet $(i,j,l)$, the desirable metric would satisfy $dM(xi,xj), meaning that the pair in the same class is closer than the pair in different classes. For each of the triplets, Weinberger and Saul (2009) define a loss function that penalizes violations of this constraint,
$ℓdM2(xi,xl)-dM2(xi,xj),for(i,j,l)∈T,$
where $T$ is a set of triplets and $ℓ:R→R$ is some loss function (e.g., the standard hinge loss function). In addition to the triplet loss, other approaches, such as pairwise- and virtual point–based loss functions have been studied. In the pairwise approach, the number of pairs can be much smaller than the triplets; Davis et al. (2007) used only $20c2$ pairs, where $c$ is the number of classes. The virtual point approach (Perrot & Habrard, 2015) converts the metric learning problem into a least squares problem, which minimizes a loss function for $n$ virtual points. We particularly focus on the triplet approach because the relative evaluation $dM(xi,xj) would be more appropriate for many metric learning applications such as nearest-neighbor classification (Weinberger & Saul, 2009), and similarity search (Jain, Kulis, Dhillon, & Grauman, 2009), in which relative comparison among objects plays an essential role. In fact, a recent comprehensive survey (Li & Tian, 2018) showed that many current state-of-the-art methods are based on triplet loss, which they referred to as relative loss. Note that although the quadruplets approach (Law, Thome, & Cord, 2013) can also incorporate higher-order relations, we mainly focus on the triplet approach because it is much more popular in the community, although our framework can also accommodate the quadruplet case, as we explain in section 6.3.

However, the set of triplets $T$ is quite large even for a small data set. For example, in a two-class problem with 100 instances in each class, the number of possible triplets is 1,980,000. Because processing a huge number of triplets is computationally prohibitive, a small subset is often used in practice (Weinberger & Saul, 2009; Shi, Bellet, & Sha, 2014; Capitaine, 2016). Typically, a subset of triplets is selected by using the neighbors of each training instance. For $n$ training instances, Shi et al. (2014) selected only $30n$ triplets, and Weinberger and Saul (2009) selected at most $O(kn2)$ triplets, where $k$ is a prespecified constant. However, the effect on the final accuracy of these heuristic selections is difficult to know beforehand. Jain, Mason, and Nowak (2017) theoretically analyzed a probabilistic generalization error bound for a random subsampling strategy of triplets. Their analysis revealed the sample complexity of metric learning, but the tightness of the bound is not clear and they did not demonstrate the practical use of determining the required number of triplets. For ordinal data embedding, Jamieson and Nowak (2011) showed a lower bound of required triplets $Ω(dnlogn)$ to determine the embedding, but the tightness of this bound is also not known. Further, the applicability of the analysis to metric learning was not clarified.

Our safe triplet screening enables the identification of triplets that can be safely removed from the optimization problem without losing the optimality of the resulting metric. This means that our approach can accelerate the optimization of time-consuming metric learning with the guarantee of optimality. Figure 1 shows a schematic illustration of safe triplet screening.

Figure 1:

Metric learning with safe triplet screening. The naive optimization needs to minimize the sum of the loss function values for a huge number of triplets $(i,j,l)$. Safe triplet screening identifies a subset of $L★$ (blue points in the illustration on the right) and $R★$ (green points in the illustration on the right), corresponding to the location of the loss function on which each triplet lies by using the optimal $M★$. This enables reducing the number of triplets to be reduced in the optimization problem.

Figure 1:

Metric learning with safe triplet screening. The naive optimization needs to minimize the sum of the loss function values for a huge number of triplets $(i,j,l)$. Safe triplet screening identifies a subset of $L★$ (blue points in the illustration on the right) and $R★$ (green points in the illustration on the right), corresponding to the location of the loss function on which each triplet lies by using the optimal $M★$. This enables reducing the number of triplets to be reduced in the optimization problem.

Close modal

Our approach is inspired by the safe feature screening of Lasso (Ghaoui, Viallon, & Rabbani, 2010), in which unnecessary features are identified by the following procedure:

• Step 1: Construct a bounded region in which the optimal dual solution is guaranteed to exist.

• Step 2: Given the bound created by step 1, remove features that cannot be selected by Lasso.

This procedure is useful to mitigate the optimization difficulty of Lasso for high-dimensional problems; thus, many papers propose a variety of approaches to create bounded regions for obtaining a tighter bound that increases screening performance (Wang, Zhou, Wonka, & Ye, 2013; Liu, Zhao, Wang, & Ye, 2014; Fercoq, Gramfort, & Salmon, 2015; Xiang, Wang, & Ramadge, 2017). As another direction of research, the screening idea was applied to other learning methods, including support vector machine nonsupport vector screening (Ogawa, Suzuki, & Takeuchi, 2013), nuclear norm regularization subspace screening (Zhou & Zhao, 2015), and group Lasso group screening (Ndiaye, Fercoq, Gramfort, & Salmon, 2016).

Based on the safe feature screening techniques, we build the procedure of our safe triplet screening as follows:

• Step 1: Construct a bounded region in which the optimal solution $M★$ is guaranteed to exist.

• Step 2: For each triplet $(i,j,l)∈T$, verify the possible loss function value under the condition created by step 1.

We show that as a result of step 2, we can reduce the size of the metric learning optimization problem, by which the computational cost of the optimization can be drastically reduced. Although a variety of extensions of safe screening have been studied in the machine learning community (Lee & Xing, 2014; Wang, Wonka, & Ye, 2014; Zimmert, de Witt, Kerg, & Kloft, 2015; Zhang et al., 2016; Ogawa et al., 2013; Okumura, Suzuki, & Takeuchi, 2015; Shibagaki, Karasuyama, Hatano, & Takeuchi, 2016; Shibagaki, Suzuki, Karasuyama, & Takeuchi, 2015; Nakagawa, Suzumura, Karasuyama, Tsuda, & Takeuchi, 2016; Takada, Hanada, Yamada, Sakuma, & Takeuchi, 2016; Hanada, Shibagaki, Sakuma, & Takeuchi, 2018), to the best of our knowledge, no studies have considered screening for metric learning. Compared with existing studies, our safe triplet screening is particularly significant due to the huge number of possible triplets and the semidefinite constraint. Our technical contributions are summarized as follows:

• We derive six spherical regions in which the optimal $M★$ must lie and analyze their relationships.

• We derive three types of screening rules, each of which employs a different approach to the semidefinite constraint.

• We derive efficient rule evaluation for a special case when $M$ is a diagonal matrix.

• We build an extension for the regularization path calculation.

We further demonstrate the effectiveness of our approach based on several benchmark data sets with a huge number of triplets.

This letter is organized as follows. In section 2, we define the optimization problem of large-margin metric learning. In section 3, we first derive six bounds containing optimal $M★$ for the subsequent screening procedure. Section 4 derives the rules and constructs our safe triplet screening. The computational cost for the rule evaluation is analyzed in section 5. Extensions are discussed in section 6, in which an algorithm specifically designed for the regularization path calculation, and a special case, in which $M$ is a diagonal matrix, are considered. In section 7, we present the evaluation of our approach through numerical experiments. Section 8 concludes.

1.1  Notation

We denote by $[n]$ the set ${1,2,…,n}$ for any integer $n∈N$. The inner product of the matrices is denoted by $〈A,B〉:=∑ijAijBij=tr(A⊤B)$. The squared Frobenius norm is represented by $AF2:=〈A,A〉$. The positive semidefinite matrix $M∈Rd×d$ is denoted by $M⪰O$ or $M∈R+d×d$. By using the eigenvalue decomposition of matrix $M=VΛV⊤$, matrices $M+$ and $M-$ are defined as follows,
$M=V(Λ++Λ-)︸ΛV⊤=VΛ+V⊤︸:=M++VΛ-V⊤︸:=M-,$
where $Λ+$ and $Λ-$ are constructed only by the positive and negative components of the diagonal matrix $Λ$. Note that $〈M+,M-〉=tr(VΛ+V⊤VΛ-V⊤)=tr(VOV⊤)=0$, and $M+$ is a projection of $M$ onto the semidefinite cone—$M+=argminA⪰O∥A-M∥F2$.
Let ${(xi,yi)∣i∈[n]}$ be $n$ pairs of a $d$-dimensional feature vector $xi∈Rd$ and a label $yi∈Y$, where $Y$ is a discrete label space. We consider learning the following Mahalanobis distance,
$dM(xi,xj):=(xi-xj)⊤M(xi-xj),$
(2.1)
where $M∈R+d×d$ is a positive semidefinite matrix that parameterizes distance. As a general form of the metric learning problem, we consider a regularized triplet loss minimization (RTLM) problem. Our formulation is mainly based on a model originally proposed by Weinberger and Saul (2009), which is reduced to a convex optimization problem with the semidefinite constraint. For later analysis, we derive primal and dual formulations, and to discuss the optimality of the learned metric, we focus on the convex formulation of RTLM in this letter.

2.1  Triplet-Based Loss Function

We define a triplet of instances as
$T=(i,j,l)∣(i,j)∈S,yi≠yl,l∈[n],$
where $S=(i,j)∣yi=yj,i≠j,(i,j)∈[n]×[n]$. The set $S$ contains index pairs from the same class, and $T$ represents a triplet of indices consisting of $(i,j)∈S$, and $l$, which is in a class that differs from that of $i$ and $j$. We refer to the following loss as the triplet loss:
$ℓdM2(xi,xl)-dM2(xi,xj),for(i,j,l)∈T,$
where $ℓ:R→R$ is some loss function. By substituting equation 2.1 into the triplet loss, this can be written as
$ℓ〈M,Hijl〉,$
where $Hijl:=(xi-xl)(xi-xl)⊤-(xi-xj)(xi-xj)⊤$. For the triplet loss, we consider the hinge function,
$ℓ(x)=max{0,1-x},$
(2.2)
or the smoothed hinge function,
$ℓ(x)=0,x>1,12γ(1-x)2,1-γ≤x≤1,1-x-γ2,x<1-γ,$
(2.3)
where $γ>0$ is a parameter. Note that the smoothed hinge includes the hinge function as a special case ($γ→0$). The triplet loss imposes a penalty if a pair $(i,j)∈S$ is more distant than the threshold compared with a pair $i$ and $l$, which are in different classes. Both of the two loss functions contain a region in which no penalty is imposed. We refer to this as the zero region. The two loss functions also contain a region in which the penalty increases linearly, which we refer to as the linear region.

2.2  Primal and Dual Formulation of Triplet-Based Distance Metric Learning

Using the standard squared regularization, we consider the following RTLM as a general form of metric learning:
$minM⪰OPλ(M):=∑ijlℓ〈M,Hijl〉+λ2MF2,$
(Primal)
where $∑ijl$ denotes $∑(i,j,l)∈T$, and $λ>0$ is a regularization parameter. In section 6.3, we discuss the relation of RTLM to existing metric learning methods.
The dual problem is written as
$max0≤α≤1,Γ⪰ODλ(α,Γ):=-γ2∥α∥22+α⊤1-λ2∥Mλ(α,Γ)∥F2,$
(Dual1)
where $α∈R|T|$, which contains $αijl$ for $(i,j,l)∈T$, and $Γ∈Rd×d$ are dual variables, and
$Mλ(α,Γ):=1λ∑ijlαijlHijl+Γ.$
(2.4)
A derivation of this dual problem is presented in appendix A. Because the last term $maxΓ⪰O-12∥Mλ(α,Γ)∥F2$ is equivalent to the projection onto a semidefinite cone (Boyd & Xiao, 2005; Malick, 2004), the above problem, Dual1, can be simplified as
$max0≤α≤1Dλ(α):=-γ2∥α∥22+α⊤1-λ2Mλ(α)F2,$
(Dual2)
where
$Mλ(α):=1λ∑ijlαijlHijl+.$
For the optimal $M★$, each of the triplets in $T$ can be categorized into three groups:
$R★:={(i,j,l)∈T∣〈Hijl,M★〉>1},C★:={(i,j,l)∈T∣1-γ≤〈Hijl,M★〉≤1},L★:={(i,j,l)∈T∣〈Hijl,M★〉<1-γ}.$
(2.5)
This indicates that the triplets in $R★$ and those in $L★$ are the zero region and linear region of the loss function, respectively. The well-known KKT conditions provide the following relation between the optimal dual variable and the derivative of the loss function (see appendix A for details):
$αijl★=-∇ℓ(〈M★,Hijl〉).$
(2.6)
In the case of hinge loss, the derivative is written as
$∇ℓ(〈M★,Hijl〉)=0,〈M★,Hijl〉>1,-c,〈M★,Hijl〉=1,-1,〈M★,Hijl〉<1,$
where $∀c∈[0,1]$. In the case of smoothed hinge loss, the derivative is
$∇ℓ(〈M★,Hijl〉)=0,〈M★,Hijl〉>1,-1γ(1-〈M★,Hijl〉),1-γ≤〈M★,Hijl〉≤1,-1,〈M★,Hijl〉<1-γ.$
Both cases can be represented as
$∇ℓ(〈M★,Hijl〉)=0,(i,j,l)∈R★,∈-[0,1],(i,j,l)∈C★,=-1,(i,j,l)∈L★.$
(2.7)
From equations 2.7 and 2.6, we obtain the following rules for the optimal dual variable:
$(i,j,l)∈R★⇒αijl★=0,(i,j,l)∈C★⇒αijl★∈[0,1],(i,j,l)∈L★⇒αijl★=1.$
(2.8)

The nonlinear semidefinite programming problem of RTLM can be solved by gradient methods including the primal-based (Weinberger & Saul, 2009) and dual-based approaches (Shen, Kim, Liu, Wang, & Van Den Hengel, 2014). However, the amount of computation may be prohibitive because of the large number of triplets. The naive calculation of the objective function requires $O(d2|T|)$ computations for both the primal and the dual cases.

2.3  Reduced-Size Optimization Problem

Assuming that we have a subset of triplets $(i,j,l)∈L★∪R★$ before solving the optimization problem. Let $L^⊆L★$ and $R^⊆R★$ be the subsets of $L★$ and $R★$ we identify. Then, based on this prior knowledge, the optimization problem, Primal, can be transformed into the following reduced-size problem:
$P˜λ(M)=∑(i,j,l)∈T˜ℓ(〈M,Hijl〉)+λ2∥M∥F2+1-γ2|L^|-〈M,∑(i,j,l)∈L^Hijl〉,$
(2.9)
where $T˜:=T-L^-R^$. This problem differs from the original, Primal, as follows
• The loss term for $R^$ is removed because it does not produce any penalty at the optimal solution.

• The loss term for $L^$ is fixed at the linear part of the loss function by which the sum over triplets can be calculated beforehand (the last two terms).

The dual problem of this reduced-size problem can be written as
$min0≤α≤1D˜λ(α):=-γ2∥α∥22+α⊤1-λ2Mλ(α)F2,s.t.αL^=1,αR^=0.$
(2.10)
which is the same optimization problem as Dual2 except that $αL^$ and $αR^$ are fixed. Because of this constraint, the number of free variables in this dual problem is $|T˜|$. An important property of a reduced-size problem is that it retains the same optimal solution as the original problem:
Lemma 1.

The primal-dual problem pair, equations 2.9 and 2.10, and the original problem pair, Primal and Dual2, have the same optimal primal and dual solutions.

The proof of this lemma is shown in appendix B, along with the derivation of the reduced-size dual, equation 2.10. Therefore, if a large number of $L^$ and $R^$ could be detected beforehand (i.e., $|T˜|≪|T|$), the metric learning optimization would be accelerated dramatically.

As we will see, our safe triplet screening is derived by using a spherical region that contains the optimal $M★$. In this section, we show that six variants of the regions are created by three types of different approaches. Note that the proofs for all the theorems appear in the appendixes.

We first introduce a hypersphere, which we name gradient bound (GB), because the center and radius of the hypersphere are represented by the subgradient of the objective function:

Theorem 1
(GB). Given any feasible solution $M⪰O$, the optimal solution $M★$ for $λ$ exists in the following hypersphere:
$M★-QGB(M)F2≤12λ∇Pλ(M)F2,$
where $QGB(M):=M-12λ∇Pλ(M)$.

The proof is in appendix C. This theorem is an extension of the sphere for SVM (Shibagaki et al., 2015), which can be treated as a simple unconstrained problem.

Even when we substitute the optimal $M★$ into the reference solution $M$, the radius of the GB is not guaranteed to be 0. By projecting the center of GB onto the feasible region (i.e., a semidefinite cone), another GB-based hypersphere can be derived, which has a radius converging to 0 at the optimal. We refer to this extension as projected gradient bound (PGB); a schematic illustration is shown as Figure 2a. In Figure 2a, the center of the GB $QGB$ (the abbreviation of $QGB(M)$) is projected onto the semidefinite cone, which becomes the center of PGB $Q+GB$. The sphere of PGB can be written as

Figure 2:

Illustrations of spherical bounds.

Figure 2:

Illustrations of spherical bounds.

Close modal
Theorem 2
(PGB). Given any feasible solution $M⪰O$, the optimal solution $M★$ for $λ$ exists in the following hypersphere:
$M★-QGB(M)+F2≤12λ∇Pλ(M)F2-QGB(M)-F2.$

The proof is in appendix D. PGB contains the projections onto the positive and the negative semidefinite cone in the center and the radius, respectively. These projections require the eigenvalue decomposition of $M-12λ∇Pλ(M)$. This decomposition, however, only needs to be performed once to evaluate the screening rules of all the triplets. In the standard optimization procedures of RTLM, including Weinberger and Saul (2009), the eigenvalue decomposition of the $d×d$ matrix is calculated in every iterative cycle, and thus, the computational complexity is not increased by PGB.

The following theorem shows a superior convergence property of PGB compared to GB:

Theorem 3.

There exists a subgradient $∇Pλ(M★)$ such that the radius of PGB is 0.

For the hinge loss, which is not differentiable at the kink, the optimal dual variables provide subgradients that set the radius equal to 0. This theorem is an immediate consequence of the proof in appendix I, which is the proof for the relation between PGB and the other bound derived in section 3.4.

From Figure 2a, we see that the half space $〈-Q-GB,X〉≥0$, where $Q-GB=QGB-Q+GB$, can be used as a linear relaxation of the semidefinite constraint for the linear constraint rule in section 4.3. Interestingly, the GB with this linear constraint is tighter than the PGB. This is proved in appendix D, which gives the proof of the PGB.

3.3  Duality Gap Bound

In this section, we describe the duality gap bound (DGB) in which the radius is represented by the duality gap:

Theorem 4
(DGB). Let $M$ be a feasible solution of the primal problem and $α$ and $Γ$ be feasible solutions of the dual problem. Then the optimal solution of the primal problem $M★$ exists in the following hypersphere:
$M★-MF2≤2(Pλ(M)-Dλ(α,Γ))/λ.$

The proof is in appendix E. Because the radius is proportional to the square root of the duality gap, DGB obviously converges to 0 at the optimal solution (see Figure 2b). The DGB, unlike the previous bounds, requires a dual feasible solution. This means that when a primal-based optimization algorithm is employed, we need to create a dual feasible solution from the primal feasible solution. A simple way to create a dual feasible solution is to substitute the current $M$ into $M★$ of equation 2.6. When a dual-based optimization algorithm is employed, a primal feasible solution can be created by equation 2.4.

For the DGB, we can derive a tighter bound, the constrained duality gap bound (CDGB), with an additional constraint. However, except for a special case (dynamic screening with a dual solver), additional transformation of the reference solution is necessary, which can deteriorate the duality gap. See appendix F for further details.

3.4  Regularization Path Bound

In Wang et al. (2014), a hypersphere is proposed specifically for the regularization path, in which the optimization problem should be solved for a sequence of $λ$s. Suppose that $λ0$ has already been optimized and it is necessary to optimize $λ1$. Then the same approach as Wang et al. (2014) is applicable to our RTLM, which derives a bound depending on the optimal solution for $λ0$ as a reference solution:

Theorem 5
(RPB). Let $M0★$ be the optimal solution for $λ0$. Then the optimal solution $M1★$ for $λ1$ exists in the following hypersphere:
$M1★-λ0+λ12λ1M0★F2≤λ0-λ12λ1M0★F2.$

The proof is in appendix G. We refer to this bound as the regularization path bound (RPB).

The RPB requires the theoretically optimal solution $M0★$, which is numerically impossible. Furthermore, because the reference solution is fixed on $M0★$, the RPB can be performed only once for a specific pair of $λ0$ and $λ1$ even if the optimal $M0★$ is available. The other bounds can be performed multiple times during the optimization by regarding the current approximate solution as a reference solution.

3.5  Relaxed Regularization Path Bound

To use the RPB in practice, we modify this bound in such a way that the approximate solution can be used as a reference solution. Assume that $M0$ should satisfy
$∥M0★-M0∥F≤ε,$
where $ε≥0$ is a constant. Given $M0$, which satisfies the above condition, we obtain the relaxed regularization path bound (RRPB):
Theorem 6
(RRPB). Let $M0$ be an approximate solution for $λ0$, which satisfies $∥M0★-M0∥F≤ε$. The optimal solution $M1★$ for $λ1$ exists in the following hypersphere:
$M1★-λ0+λ12λ1M0F2≤|λ0-λ1|2λ1M0F+|λ0-λ1|+λ0+λ12λ1ε2.$
(3.1)

The proof is in appendix H. The intuition behind the RRPB is shown in Figure 2d, in which the approximation error for the center of the RPB is depicted. In the theorem, the RRPB also considers the error in the radius, although it is not illustrated in the figure for simplicity. To the best of our knowledge, this approach has not been introduced in other existing screening studies.

For example, $ε$ can be set from theorem 4 (DGB) as follows:
$ε=2(Pλ0(M0)-Dλ0(α0,Γ0))/λ0.$
(3.2)
When the optimization for $λ0$ terminates, the solution $M0$ should be accurate in terms of some stopping criterion such as the duality gap. Then $ε$ is expected to be quite small, and the RRPB can provide a tight bound for $λ1$, which is close to the ideal (but not computable) RPB. As a special case, by setting $λ1=λ0$, the RRPB can be applied to perform the screening of $λ1$ using any approximate solution having $∥M1★-M∥F≤ε$, and then the RRPB is equivalent to the DGB.

3.6  Analytical Relation between Bounds

The following theorem describes the relation between PGB and RPB:

Theorem 7

(Relation between PGB and RPB). Suppose that the optimal solution $M0★$ for $λ0$ is substituted into the reference solution $M$ of PGB. Then there exists a subgradient $∇Pλ1(M0★)$ by which the PGB and RPB provide the same center and radius for $M1★$.

The proof is presented in appendix I. The following theorem describes the relation between the DGB and RPB:

Theorem 8

(Relation between DGB and RPB). Suppose that the optimal solutions $M0★,α0★$, and $Γ0★$ for $λ0$ are substituted into the reference solutions $M,α$, and $Γ$ of the DGB. Then the radius of DGB and RPB for $λ1$ has a relation $rDGB=2rRPB$, and the hypersphere of RPB is included in the hypersphere of DGB.

The proof is in appendix J. Figure 2c illustrates the relation between the DGB and RPB, which shows the theoretical advantage of the RPB for the regularization path setting.

Using the analytical results obtained thus far, we summarize relative relations between the bounds as follows. First, we consider the case in which the reference solution is optimal for $λ0$ in the regularization path calculation. We obviously see $rGB≥rPGB$ from Figure 2a, and from theorems 7 and 8, we see $DGB⊇PGB=RPB=RRPB$. When the reference solution is an approximate solution in the regularization path calculation, we see only $rGB≥rPGB$. For dynamic screening in which the reference solution is always an approximate solution, we see $rGB≥rPGB$, and we also see $RRPB=DGB$ when $ε$ is determined by DGB as written in equation 3.2.

Other properties of the bounds are summarized in Table 1. Although DGB and RRPB (RPB + DGB) have the same properties, our empirical evaluation in section 7.2 shows that RRPB often outperforms DGB in the regularization path calculation. (Note that although CDGB also has the same properties as the above two methods, we omit it in the empirical evaluation because of its practical limitation, as we see in section 3.3.)

Table 1:
Comparison of Sphere Bounds.
Radius ConvergenceDynamic ScreeningReference SolutionExact Optimality of Reference
GB Can be $>0$ Applicable Primal Not necessary
PGB $=0a$ Applicable Primal Not necessary
DGB $=0$ Applicable Primal/dual Not necessary
CDGB $=0$ Applicable Primal/dual Not necessary
RPB NA Not applicable Primal Necessary
RRPB $=0$ Applicable Primal/dual Not necessary
(RPB + DGB)
Radius ConvergenceDynamic ScreeningReference SolutionExact Optimality of Reference
GB Can be $>0$ Applicable Primal Not necessary
PGB $=0a$ Applicable Primal Not necessary
DGB $=0$ Applicable Primal/dual Not necessary
CDGB $=0$ Applicable Primal/dual Not necessary
RPB NA Not applicable Primal Necessary
RRPB $=0$ Applicable Primal/dual Not necessary
(RPB + DGB)

Note: The radius convergence indicates a radius when the reference solution is the optimal solution.

$a$For the hinge loss ($γ=0$) case, a subgradient is required to be selected appropriately for achieving this convergence.

Our safe triplet screening can reduce the number of triplets by identifying a part of $L★$ and $R★$ before solving the optimization problem based on the following procedure:

• Step 1: Identify the spherical region in which the optimal solution $M★$ lies, based on the current feasible solution we refer to as the reference solution.

• Step 2: For each triplet $(i,j,l)∈T$, verify the possibility of $(i,j,l)∈L★$ or $(i,j,l)∈R★$ under the condition that $M★$ is in the region.

In section 3, we showed that there exist a variety of approaches to creating the spherical region for step 1. In this section, we describe the procedure of step 2 given the sphere region.

Letting $B$ be a region that contains $M★$, the following screening rule can be derived from equation 2.5:
$maxX∈B〈X,Hijl〉<1-γ⇒(i,j,l)∈L★,$
(R1)
$minX∈B〈X,Hijl〉>1⇒(i,j,l)∈R★.$
(R2)
Based on these rules, $L^⊆L★$ and $R^⊆R★$ are constructed as
$L^=(i,j,l)|maxX∈B〈X,Hijl〉<1-γ,R^=(i,j,l)|minX∈B〈X,Hijl〉>1.$
We present an efficient approach to evaluating these rules. Because equation R1 can be evaluated in the same way as R2, we are concerned only with equation R2 henceforth.

4.1  Spherical Rule

Suppose that the optimal $M★$ lies in a hypersphere defined by a center $Q∈Rd×d$ and a radius $r∈R+$. To evaluate the condition of equation R2, we consider the following minimization problem, equation P1:
$minX〈X,Hijl〉s.t.X-QF2≤r2.$
(P1)
Letting $Y:=X-Q$, this problem is transformed into
$minY〈Y,Hijl〉+〈Q,Hijl〉s.t.YF2≤r2.$
Because $〈Q,Hijl〉$ is a constant, this optimization problem entails minimizing the inner product $〈Y,Hijl〉$ under the norm constraint. The optimal $Y★$ of this optimization problem is easily derived as
$Y★=-rHijl/HijlF,$
and then the minimum value of P1 is $〈Hijl,Q〉-rHijlF$. Figure 3 shows a schematic illustration. This derives the following spherical rule:
$〈Hijl,Q〉-rHijlF>1⇒(i,j,l)∈R★.$
(4.1)
This condition can be easily evaluated for a given $Q$ and $r$.
Figure 3:

Spherical rule defined by equation P1. The yellow sphere indicates the region in which the optimal $M★$ must exist. The terms “$max$” and “$min$” indicate the points at which the maximum and minimum values of the inner product $〈X,Hijl〉$ are attained. If $〈X★,Hijl〉>1$ holds, condition R2 is guaranteed to be satisfied.

Figure 3:

Spherical rule defined by equation P1. The yellow sphere indicates the region in which the optimal $M★$ must exist. The terms “$max$” and “$min$” indicate the points at which the maximum and minimum values of the inner product $〈X,Hijl〉$ are attained. If $〈X★,Hijl〉>1$ holds, condition R2 is guaranteed to be satisfied.

Close modal

4.2  Spherical Rule with a Semidefinite Constraint

The spherical rule does not utilize the positive semidefiniteness of $M★$; therefore, a stronger rule can be constructed by incorporating a semidefinite constraint into equation P1:
$minX〈X,Hijl〉s.t.X-QF2≤r2,X⪰O.$
(P2)
Although the analytical solution is not available, equation P2 can be solved efficiently by transforming it into the semidefinite least squares (SDLS) problem (Malick, 2004).
Let $BPSD:={X∣X-QF2≤r2,X⪰O}$ be the feasible region of the optimization problem P2. To present the connection between SDLS and equation P2, we first assume that there exists a feasible solution $X0$ for equation P2 that satisfies $〈X0,Hijl〉>1$:
$∃X0suchthat〈X0,Hijl〉>1andX0∈BPSD.$
(4.2)
Instead of equation P2, we consider the following SDLS problem:
$minX∈Rd×dX-QF2s.t.〈X,Hijl〉=1,X⪰O.$
(SDLS)
If the optimal value of this problem is greater than $r2$ (i.e., $∥X★-Q∥F2>r2$), there is no intersection between $BPSD$ and the subspace defined by $〈X,Hijl〉=1$:
$X|〈X,Hijl〉=1,X∈BPSD=∅.$
(4.3)
From assumption 4.2, we have
$X|〈X,Hijl〉>1,X∈BPSD≠∅.$
(4.4)
As $BPSD$ is a convex set, based on the two conditions 4.3 and 4.4, we derive
$X|〈X,Hijl〉≤1,X∈BPSD=∅,$
which indicates
$minX∈BPSD〈X,Hijl〉>1,$
Thus, the condition of equation R2 is satisfied.

Based on the connection shown above, the rule evaluation, equation R2, with the semidefinite constraint is summarized as follows:

1. Select an arbitrary feasible solution $X0∈BPSD$. If $〈X0,Hijl〉≤1$, we immediately see that the condition of equation R2 is not satisfied for the triplet $(i,j,l)$. Otherwise, go to the next step. Note that in this case, assumption 4.2 is confirmed because $〈X0,Hijl〉>1$).

2. Solve SDLS. If the optimal value satisfies $∥X★-Q∥F2>r2$, the triplet $(i,j,l)$ is guaranteed to be in $R★$.

For calculating the second step, we derive the following dual problem of equation SDLS based on Malick (2004):
$maxyDSDLS(y):=-[Q+yHijl]+F2+2Cy+QF2,$
where $y∈R$ is a dual variable, and $C=1$ for equation R2 and $C=1-γ$ for equation R1. Unlike the primal problem, the dual version is an unconstrained problem that has only one variable, $y$, and thus, standard gradient-based algorithms rapidly converge. We refer to the quasi-Newton optimization for this problem as the SDLS dual ascent method. During dual ascent, we can terminate the iteration before convergence if $DSDLS(y)$ becomes larger than $r2$ because the value of the dual problem does not exceed the value of the primal problem (weak duality).

Although the computation of $[Q+yHijl]+$ requires an eigenvalue decomposition, this computational requirement can be alleviated when the center $Q$ of the hypersphere is positive semidefinite. The definition determines that $Hijl$ has at most one negative eigenvalue, and then $Q+yHijl$ also has at most one negative eigenvalue. Let $λmin$ be the negative (minimum) eigenvalue of $Q+yHijl$, and $qmin$ be the corresponding eigenvector. The projection $[Q+yHijl]+$ can be expressed as $[Q+yHijl]+=(Q+yHijl)-λminqminqmin⊤$. Computation of the minimum eigenvalue and eigenvector is much easier than the full eigenvalue decomposition (Lehoucq & Sorensen, 1996).

As a special case, when $M$ is a diagonal matrix, the semidefinite constraint is reduced to the nonnegative constraint, and analytical calculation of rule P2 is possible (see section 6.2).

4.3  Spherical Rule with Linear Constraint

Here, we reduce the computational complexity by considering the relaxation of the semidefinite constraint into a linear constraint. Suppose that a region defined by the linear inequality ${X∈Rd×d∣〈P,X〉≥0}$ contains a semidefinite cone, $R+d×d⊆{X∈Rd×d∣〈P,X〉≥0}$, for which we describe the determination of $P∈Rd×d$ later. Using this relaxed constraint, condition R2 is
$minX〈X,Hijl〉s.t.X-QF2≤r2,〈P,X〉≥0.$
(P3)
This problem can be solved analytically by considering the KKT conditions as follows (see appendix K).
Theorem 9
(Analytical Solution of Equation P3). The optimal solution of equation P3 is as follows:
$〈Hijl,X★〉=0,ifHijl=aP,〈Hijl,Q〉-r∥Hijl∥F,ifP,Q-rHijl∥Hijl∥F≥0,Hijl,βP-Hijlα+Q,otherwise,$
where $a$ is a constant and
$α=∥P∥F2∥Hijl∥F2-〈P,Hijl〉2r2∥P∥F2-〈P,Q〉2,β=〈P,Hijl〉-α〈P,Q〉∥P∥F2.$
A simple way to obtain $P$ is to utilize the projection onto the semidefinite cone. Let $A∈Rd×d$ be a matrix external to the semidefinite cone as illustrated in Figure 4. In the figure, $A+$ is the projection of $A$ onto the semidefinite cone. For example, when the projected gradient for the primal problem (Weinberger & Saul, 2009) is used as an optimizer, $A$ can be an update of the gradient descent $A=M-η∇Pλ(M)$ with some step size $η>0$. Because $M-η∇Pλ(M)$ is projected onto the semidefinite cone at every iterative step of the optimization, no additional calculation is required to obtain $A$ and $A+$. Defining $A-:=A-A+$, for any $X⪰O$, we obtain
$〈A+-A,X-A+〉≥0⇔〈-A-,X〉≥0.$
The inequality on the left has its origins in the property of a supporting hyperplane (Boyd & Vandenberghe, 2004), and for the inequality on the right, we use $〈A+,A-〉=0$. By setting $P=-A-$, we obtain a linear approximation of the semidefinite constraint, which is a superset of the original semidefinite cone.
Figure 4:

Linear relaxation of semidefinite constraint. From the projection of $A$ to $A+$, the supporting hyperplane $〈-A-,X〉=0$ is constructed, and the half-space ${X∣〈-A-,X〉≥0}$ contains the semidefinite cone $X⪰O$.

Figure 4:

Linear relaxation of semidefinite constraint. From the projection of $A$ to $A+$, the supporting hyperplane $〈-A-,X〉=0$ is constructed, and the half-space ${X∣〈-A-,X〉≥0}$ contains the semidefinite cone $X⪰O$.

Close modal

A necessary condition for performing our screening is that a loss function needs to have at least one linear region or a zero region. For example, the logistic loss cannot be used for screening because it has neither a linear nor a zero region.

Algorithm 1 shows the detailed procedure of our safe screening with simple fixed step-size gradient descent. (Note that any other optimization algorithm can be combined with our screening procedure.) In the algorithm, for every freq iteration of the gradient descent, the screening rules are evaluated by using the current solution $M$ as the reference solution. As the quality of the approximate solution $M$ improves, the larger the number of triplets that can be removed from $T$. Thus, the quality of the initial solution affects the efficiency. In the case of the regularization path calculation, in which RTLM is solved for a sequence of $λ$s, a reasonable initial solution is the approximate solution to the previous $λ$. We discuss a further extension specific to the regularization path calculation in section 6.1.

Considering the computational cost of the screening procedure of algorithm 1, the rule evaluation (step 2) described in section 4 is often dominant, because the rule needs to be evaluated for each one of the triplets. The sphere, constructed in step 1, can be fixed during the screening procedure as long as the reference solution is fixed.

To evaluate the spherical rule, equation 4.1, given the center $Q$ and the radius $r$, the inner product $〈Hijl,Q〉$ and the norm $∥Hijl∥F$ need to be evaluated. The inner product $〈Hijl,Q〉$ can be calculated in $O(d2)$ operations because it is expanded as a sum of quadratic forms: $〈Hijl,Q〉=(xi-xl)⊤Q(xi-xl)-(xi-xj)⊤Q(xi-xj)$. Further, we can reuse this term from the objective function $Pλ(M)$ calculation in the case of the DGB, RPB, and RRPB. The norm $∥Hijl∥F$ can be calculated in $O(d)$ operations, and this is constant throughout the optimization process. Thus, for the DGB, RPB, or RRPB, it is possible to reduce the additional computational cost of the spherical rule for $(i,j,l)$ to $O(1)$ by calculating $∥Hijl∥F$ beforehand. The computational cost of the spherical rule with the semidefinite constraint (see section 4.2) is that of the SDLS algorithm. The SDLS algorithm needs $O(d3)$ because of the eigenvalue decomposition in every iterative cycle, which may considerably increase the computational cost. The computational cost of the spherical rule with the linear constraint (see section 4.3) is $O(d2)$.

6.1  Range-Based Extension of Triplet Screening

The screening rules presented in section 4 relate to the problem of a fixed $λ$. In this section, we regard a screening rule as a function of $λ$ to derive a range of $λ$s in which the screening rule is guaranteed to be satisfied. This is particularly useful for calculating the regularization path for which we need to optimize the metric for a sequence of $λ$s. If a screening rule is satisfied for a triplet $(i,j,l)$ in a range $(λa,λb)$, we can fix the triplet $(i,j,l)$ in $L^$ or $R^$ as long as $λ$ is in $(λa,λb)$, without computing the screening rules.

6.1.1  Deriving the Range

Let
$Q=A+B1λ$
(6.1)
be the general form of the center of a hypersphere for some constant matrices $A∈Rd×d$ and $B∈Rd×d$ and
$r2=a+b1λ+c1λ2$
(6.2)
be the general form of the radius for some constants $a∈R$, $b∈R$, and $c∈R$. The GB, DGB, RPB, and RRPB can be in this form (details are provided in appendix L, section L.1). Note that in the RRPB, equation 3.1, $λ1$ is regarded as $λ$ in the general form and $λ0$ is a constant. The condition of the spherical rule $〈Hijl,Q〉-rHijlF>1$ in equation 4.1 can be rewritten as
$〈Hijl,Q〉-12>r2∥Hijl∥F2$
with the assumption
$〈Hijl,Q〉-1>0.$
Because $〈Hijl,Q〉=〈Hijl,A〉+〈Hijl,B〉1λ,$ these two inequalities can be transformed into quadratic and linear functions of $λ$, respectively. The range of $λ$ that satisfies the two inequalities simultaneously represents the range of $λ$ in which a triplet $(i,j,l)$ must be in $R*$. The following theorem shows the range for the case of RRPB given a reference solution $M0$, which is an approximate solution for $λ0$:
Theorem 10
(Range-Based Extension of RRPB). Assuming $〈Hijl,M0〉-2+∥Hijl∥F∥M0∥F>0$ and $∥M0★-M0∥F≤ε$, a triplet $(i,j,l)$ is guaranteed to be in $R★$ for the following range of $λ$:
$λ∈λa,λb,$
where
$λa=λ0∥M0∥F∥Hijl∥F-〈Hijl,M0〉+2ε∥Hijl∥F〈Hijl,M0〉-2+∥Hijl∥F∥M0∥F,λb=λ0∥M0∥F∥Hijl∥F+〈Hijl,M0〉∥Hijl∥F∥M0∥F-〈Hijl,M0〉+2+2ε∥Hijl∥F.$

Refer to section L.2 for the proof. The computational procedure for range-based screening is shown in algorithm 2.

6.1.2  Consideration for Range Extension with Other Bounds

As shown in equation 3.1, the RRPB is based on the optimality $ε$ for the current $λ0$, and does not depend on the optimality for $λ1$, which is regarded as $λ$ in the general form of equations 6.1 and 6.2. Because of this property, the RRPB is particularly suitable to range-based screening among the spheres we derived thus far. To calculate $ε$, equation 3.2 for the RRPB, the duality gap $Pλ0(M0)-Dλ0(α0,Γ0)$ is required. Instead of the original $Pλ0(M0)-Dλ0(α0,Γ0)$, we can use problems with a reduced size, $P˜λ0(M0)-D˜λ0(α0,Γ0)$, for efficient computation, where $D˜λ0$ is the dual objective in which $αi=0$ for $i∈R^$ and $αi=1$ for $i∈L^$ are fixed. Because the reduced-size problem shares exactly the same optimal solution with the original problems, this gap also provides a valid bound. As a result, we can avoid computing the sum over all triplets in $T$ (e.g., to calculate the loss term in the original primal) to evaluate a bound.

Figure 5:

(a) Suppose that $M★$ is the optimal solution for $λ0$, and some iterative optimization algorithm obtains $Mprev$ in the middle of the optimization process. The circle around $Mprev$ represents the DGB, which contains $M★$. Then the screening rule can eliminate the triplet $(i,j,l)$ because $〈M,Hijl〉>1$ holds for any points in the circle. Now suppose that $M0$ is the approximate solution we obtain after the optimization algorithm terminates with some small tolerance of the duality gap. The circle with the dashed line represents the region in which the duality gap is less than the tolerance. Although $M0$ satisfies the terminate condition, the inequality $〈M,Hijl〉>1$ does not hold for $M0$. In this case, we cannot ignore this triplet $(i,j,l)$ to evaluate the duality gap for different $λ≠λ0$ because it causes a nonzero penalty. (b) An enlarged bound. Because of the inequality of DGB $∥M★-M0∥F≤2ε/λ$, this enlarged region contains any approximate solutions with the duality gap $≤ε$.

Figure 5:

(a) Suppose that $M★$ is the optimal solution for $λ0$, and some iterative optimization algorithm obtains $Mprev$ in the middle of the optimization process. The circle around $Mprev$ represents the DGB, which contains $M★$. Then the screening rule can eliminate the triplet $(i,j,l)$ because $〈M,Hijl〉>1$ holds for any points in the circle. Now suppose that $M0$ is the approximate solution we obtain after the optimization algorithm terminates with some small tolerance of the duality gap. The circle with the dashed line represents the region in which the duality gap is less than the tolerance. Although $M0$ satisfies the terminate condition, the inequality $〈M,Hijl〉>1$ does not hold for $M0$. In this case, we cannot ignore this triplet $(i,j,l)$ to evaluate the duality gap for different $λ≠λ0$ because it causes a nonzero penalty. (b) An enlarged bound. Because of the inequality of DGB $∥M★-M0∥F≤2ε/λ$, this enlarged region contains any approximate solutions with the duality gap $≤ε$.

Close modal
In the other bounds, the loss term in the primal objective needs to be carefully considered. Suppose that we have an approximate solution $M0$ for $λ0$ as a reference solution. To regard a bound as a function of $λ$ in the GB and PGB, it is necessary to consider the gradient for $λ$ (i.e., $∇Pλ(M0)$), and the DGB requires the objective value for $λ$ (i.e., $Pλ(M0)$). These two terms may not be correctly calculated if we replace them with the reduced-size primal created for $λ0$. Figure 5a illustrates an example of this problem in the case of DGB. To safely replace $Pλ(M0)$ with the reduced-size primal $P˜λ(M0)$ for these cases, the following conditions need to hold:
$〈M0,Hijl〉<1-γ,for∀(i,j,l)∈L^,〈M0,Hijl〉>1,for∀(i,j,l)∈R^.$
If the reference solution $M0$ is exactly optimal for $λ0$, these conditions hold. However, in practice, this cannot be true because of numerical errors, and furthermore, the optimization algorithm is usually terminated with some tolerance.
This problem can be avoided by enlarging the radius of spherical bounds such that the bound contains the approximate solution. Assuming that $M0$ is an approximate solution with the duality gap $ε$, then from the DGB, we see that the distance between $M0$ and the optimal solution $M★$ satisfies
$∥M★-M0∥F≤2ελ.$
This inequality indicates that by enlarging the radius of the hypersphere by $2ελ$, we can guarantee that the bound includes any approximate solutions (Figure 5b shows an illustration). Using the radius $r$ introduced in section 6.1, we obtain the enlarged radius $R$ as follows:
$R=a+b1λ+c1λ2︸r+2ελ.$
(6.3)
The reduced-size problems created by this enlarged radius can be safely used to evaluate the duality gap for any $λ$. However, this enlarged radius no longer has the general form of the radius 6.2 we assumed to derive the range. Although it is possible to derive a range even for the enlarged radius $R$, the calculation becomes quite complicated, and thus we do not pursue this direction in this study. (Appendix M shows the computational procedure.) Further, an increase in the radius may decrease the screening rate.

6.2  Screening with Diagonal Constraint

When the matrix $M$ is constrained to be a diagonal matrix, metric learning is reduced to feature weighting in which the Mahalanobis distance, equation 2.1, simply adapts a weight of each feature without combining different dimensions. Although correlation in different dimensions is not considered, this simpler formulation is useful to avoid a large computational cost for high-dimensional data mainly because of the following two reasons:

• The number of variables in the optimization decreases from $d2$ to $d$.

• The semidefinite constraint for a diagonal matrix is reduced to the nonnegative constraint of diagonal elements.

Both properties are also beneficial for efficient screening rule evaluation; in particular, the second property makes the screening rule with the semidefinite constraint easier to evaluate.

The minimization problem of the spherical rule with the semi-definite constraint, equation P2, is simplified as
$minx∈Rdx⊤hijls.t.x-q22≤r2,x≥0,$
(P4)
where $hijl:=diag(Hijl)$. Let
$L(x,α,β):=x⊤hijl-α(r2-x-q22)-β⊤x,$
be the Lagrange function of equation P4, where $α≥0$ and $β≥0$ are dual variables. The KKT conditions are written as
$∂L/∂x=hijl+2α(x-q)-β=0,$
(6.4a)
$α(r2-x-q22)=0,βkxk=0,$
(6.4b)
$α≥0,r2-x-q22≥0,β≥0,x≥0.$
(6.4c)
We derive the analytical representation of the optimal solution for cases of $α>0$ and $α=0$, respectively. For $α>0$, the following theorem is obtained.
Theorem 11.
If the optimal dual variable satisfies $α>0$, the optimal $x$ and $β$ of equation P4 can be written as
$xk=qk-hijl,k/2α,ifhijl,k-2αqk≤0,0,otherwise,$
(6.5)
and
$β=hijl+2α(x-q).$
(6.6)
Then $x$ also satisfies
$∥x-q∥22=r2.$
(6.7)

For $α=0$, the following theorem is obtained.

Theorem 12.
If the optimal dual variable satisfies $α=0$, the optimal $x$ and $β$ of equation P4 can be written as
$xk=0,ifhijl,k>0max{qk,0},otherwise,$
(6.8)
and
$β=hijl.$
(6.9)

The proofs for theorems 11 and 12 are in sections N.1 and N.2, respec-tively.

Based on the theorems, the optimal solution of equation P4 can be calculated analytically. The detail of the procedure is shown in section N.3, which requires $O(d2)$ computations. Although this procedure obtains the solution by using the fixed steps of analytical calculations, for larger values of $d$, iterative optimization algorithms can be faster. For example, we can apply the SDLS dual ascent to problem P4 in which each iterative step takes $O(d)$.

6.3  Applicability to More General Formulation

Throughout the letter, we analyze screening theorems based on the optimization problem defined by Primal. RTML is the Frobenius norm-regularized triplet loss minimization, which has been shown to be an effective formulation of metric learning (Schultz & Joachims, 2004; Shen et al., 2014). Further, with slight modifications, our screening framework can accommodate a wider range of metric learning methods. Here we redefine the optimization problem as follows:
$minM⪰O∑iℓ〈M,Ci〉+λ2MF2,$
(6.10)
where $Ci∈Rd×d$ is a constant matrix. All our sphere bounds (GB, PGB, DGB, RPB, and RRPB) still hold for this general representation if the loss function $ℓ:R→R$ is convex subdifferentiable. The rules (spherical rule, sphere with semidefinite constraint, and sphere with linear constraint) can also be constructed if the loss function has the form of the hinge type loss function, equations 2.2 and 2.3, by which standard hinge-, smoothed hinge–, and squared hinge–loss functions are included.
We can incorporate an additional linear term into the objective function 6.10. Defining a pseudo-loss function $ℓ˜(x)=-x$, we write the primal problem with a linear term as
$minM⪰OPλ(M):=∑ijlℓ〈M,Hijl〉+ℓ˜(〈M,C〉)+λ2MF2,$
(6.11)
which can be seen as a special case of equation 6.10 because $ℓ˜$ is convex subdifferentiable. Suppose that $ηij∈{0,1}$ indicates whether $xj$ is a target neighbor of $xi$, which is a neighbor of $xi$ having the same label. When we define $C:=-∑ijηij(xi-xj)(xi-xj)⊤$ and employ the hinge loss, equation 2.2, this formulation is the well-known LMNN (Weinberger & Saul, 2009) with the additional Frobenius norm regularization. Another interpretation of this linear term is the trace norm regularization (Kulis, 2013). For the pseudo-loss term $ℓ˜$, the derivative is $∇ℓ˜(x)=-1$, and the conjugate is $ℓ˜*(-a)=0$ if $-a=-1$; otherwise, $∞$, where $a$ is a dual variable. Then, by using the derivation of the dual in appendix A, the dual problem is modified as
$max0≤α≤1,a=1,Γ⪰ODλ(α,Γ):=-γ2∥α∥22+α⊤1-λ2∥Mλ(α,a,Γ)∥F2,$
where
$Mλ(α,a,Γ):=1λ∑ijlαijlHijl+aC+Γ.$
Because equation 6.11 is a special case of equation 6.10, all spheres can be derived, and we can construct the same screening rules for $αijl$ for $(i,j,l)∈T$. The only difference is that the dual variable $a$ is not associated with any screening rule because it is fixed to 1 by the dual constraint.
About the loss term, pairwise- and quadruplet-loss functions can also be incorporated into our framework. The pairwise approach considers a set of pairs in the same class $S$ and a set of pairs in the different classes $D$. Davis et al. (2007) considered constraints with threshold parameters $U$ and $L$: $dM2(xi,xj)≤U$ for $(i,j)∈S$ and $dM2(xi,xl)≥L$ for $(i,l)∈D$. Let $ℓt(x)=[t-x]+$ be a hinge loss function with threshold $t$. By using $ℓt$, the above two constraints can be relaxed to soft constraints that result in
$minM⪰O∑(i,j)∈Sℓ-U〈M,-Cij〉+∑(i,l)∈DℓL〈M,Cil〉+λ2MF2.$
Because of the threshold parameters, the second term of the dual problem, Dual2, changes from $α⊤1$ to $α⊤t$, where $t:=[L,⋯,L,-U,⋯,-U]⊤∈R|D|+|S|$. Our bounds still hold because $ℓt$ is convex subdifferentiable, and screening rules are formulated as evaluating whether the inner product $〈M,Cil〉$ (or $〈M,-Cij〉$) is larger or smaller than the threshold $t$.
Law et al. (2013) proposed a loss function based on a quadruplet of instances. The basic idea is to compare pairs of dissimilarity $dM2(xi,xj)$ and $dM2(xk,xl)$. For example, when $(k,l)$ should be more dissimilar than $(i,j)$, the loss is defined as $ℓ(dM2(xk,xl)-dM2(xi,xj))$. They define the following optimization problem,
$minM⪰O∑(i,j,k,l)∈Qℓ〈M,Cijkl〉+λ2MF2,$
where $Cijkl=(xk-xl)(xk-xl)⊤-(xi-xj)(xi-xj)⊤$ and $Q$ is a set of quadruplets. This is also a special case of equation 6.10.

Note that pairwise-, triplet- and quadruplet-loss functions can be used simultaneously, and safe screening can be applied to remove any of those loss terms.

We evaluate the performance of safe triplet screening using the benchmark data sets listed in Table 2, which are from LIBSVM (Chang & Lin, 2011) and the Keras data set (Chollet et al., 2015). We create a set of triplets by following the approach by Shen et al. (2014), in which $k$ neighborhoods in the same class $xj$ and $k$ neighborhoods in a different class $xl$ are sampled for each $xi$. We employed the regularization path setting in which RTLM is optimized for a sequence of $λ0,λ1,…,λT$. To determine $λ0=λmax$, from a sufficiently large $λ$ in which $R$ is empty, we gradually reduced $λ$ by multiplying it by 0.9 and started the regularization path calculation from $λ$ in which $R$ is not empty. To generate the next value of $λ$, we used $λt=0.9λt-1$, and the path terminated when the following condition is satisfied:
$loss(λt-1)-loss(λt)loss(λt-1)×λt-1λt-1-λt<0.01,$
where $loss(λt)$ is the loss function value at $λt$. We randomly selected 90% of the instances of each data set five times, and the average is shown as the experimental result. As a base optimizer, we employed the projected gradient descent of the primal problem, and the iteration terminated when the duality gap became less than $10-6$. For the loss function $ℓ$, we used the smoothed hinge loss of $γ=0.05$. (We also provide results for the hinge loss in section 7.4.1). We performed safe triplet screening after every 10 iterative cycles of the gradient descent. We refer to the first screening for a specific $λt$, in which the solution of the previous $λt-1$ is used as the reference solution for regularization path screening. The screening performed during the optimization process (after regularization path screening) is termed dynamic screening. We performed both of these screening procedures for all experiments. As a baseline, we refer to the RTLM optimization without screening as naive optimization. We initialized with $M=O$ at $λ0$ because $M=O$ is the optimal solution when $λ→∞$. When the regularization coefficient changes, $M$ starts from the previous solution $M^$ (warm start). The step size of the gradient descent was determined by
$12〈ΔM,ΔG〉〈ΔG,ΔG〉+〈ΔM,ΔM〉〈ΔM,ΔG〉,$
where $ΔM=Mt-Mt-1,ΔG=∇Pλ(Mt)-∇Pλ(Mt-1)$ (Barzilai & Borwein, 1988). In SDLS dual ascent, we used the conjugate gradient method (Yang, 1993) to find the minimum eigenvalue.
Table 2:
Summary of Data Sets.
#dimension#sample#classes$k$#triplet$λmax$$λmin$
Iris 150 $∞$ 546,668 1.3e$+$2.3e$+$
Wine 13 178 $∞$ 910,224 2.0e$+$5.1e$+$
Segment 19 2310 20 832,000 2.5e$+$4.2e$+$
Satimage 36 4435 15 898,200 1.0e$+$8.8e$+$
Phishing 68 11,055 487,550 5.0e$+$2.0e$-$
SensIT Vehicle 100 78,823 638,469 1.0e$+$2.9e$+$
a9a 16$a$ 32,561 732,625 1.2e$+$3.1e$+$
Mnist 32$a$ 60,000 10 1,350,025 7.0e$+$9.6e$-$
Cifar10 200$a$ 50,000 10 180,004 2.0e$+$3.3e$+$
Rcv1.multiclass 200$b$ 15,564 53 126,018 3.0e$+$6.0e$-$
#dimension#sample#classes$k$#triplet$λmax$$λmin$
Iris 150 $∞$ 546,668 1.3e$+$2.3e$+$
Wine 13 178 $∞$ 910,224 2.0e$+$5.1e$+$
Segment 19 2310 20 832,000 2.5e$+$4.2e$+$
Satimage 36 4435 15 898,200 1.0e$+$8.8e$+$
Phishing 68 11,055 487,550 5.0e$+$2.0e$-$
SensIT Vehicle 100 78,823 638,469 1.0e$+$2.9e$+$
a9a 16$a$ 32,561 732,625 1.2e$+$3.1e$+$
Mnist 32$a$ 60,000 10 1,350,025 7.0e$+$9.6e$-$
Cifar10 200$a$ 50,000 10 180,004 2.0e$+$3.3e$+$
Rcv1.multiclass 200$b$ 15,564 53 126,018 3.0e$+$6.0e$-$

Note: #triplet and $λmin$ are the average value for subsampled random trials.

$a$The dimension was reduced by AutoEncoder.

$b$The dimension was reduced by PCA.

7.1  Comparing Rules

We first validate the screening performance (screening rate and CPU time) of each screening rule introduced in section 4 by using algorithm 2 without the range-based screening process. Here, the screening rate is defined by #screened triplets/$|{(i,j,l)∈τ∣〈Hijl,M^〉>1or〈Hijl,M^〉<1-γ}|$ where $M^$ is the solution after convergence.

7.1.1  GB-Based Rules

Here we use the GB and PGB as spheres, and we observe the effect of the semidefinite constraint in the rules. As a representative result, Figure 6a compares the performance of the rules by using segment data.

Figure 6:

Comparison of screening rules on the segment data set. For both panels a and b, the plots are aligned as follows. (Top left) Performance of regularization path screening. (Bottom left) Ratio of CPU time compared with the naive optimization for each $λ$. (Right) Enlargement of the upper left plot for the range $-log10(λ)∈[-1,-0.6]$.

Figure 6:

Comparison of screening rules on the segment data set. For both panels a and b, the plots are aligned as follows. (Top left) Performance of regularization path screening. (Bottom left) Ratio of CPU time compared with the naive optimization for each $λ$. (Right) Enlargement of the upper left plot for the range $-log10(λ)∈[-1,-0.6]$.

Close modal

First, except for the GB, the rules maintain a high screening rate for the entire regularization path, as shown in the top left plot. Note that this rate is only for regularization path screening, meaning that dynamic screening can further increase the screening rate during the optimization, as discussed in the section 7.1.2. The bottom left plot of the same figure shows that PGB and GB+Linear are the most efficient and achieved CPU times approximately 2 to 10 times faster than the naive optimization. The screening rate of the GB was severely reduced along the latter half of the regularization path. As illustrated in Figure 2a, the center of the GB can be external to the semidefinite cone by which the sphere of GB contains a larger proportion of the region violating the constraint $M⪰O$, compared with the spheres with their center inside the semidefinite cone. This causes performance deterioration particularly for smaller values of $λ$, because the minimum of the loss term is usually outside the semidefinite cone.

The screening rates of GB+Linear and GB+Semidefinite are slightly higher than that of the PGB (the plot on the right), which can be seen from their geometrical relation illustrated in Figure 2a. GB+Semidefinite achieved the highest screening rate, but eigenvalue decomposition is necessary to repeatedly perform the calculation in SDLS, which resulted in the CPU time increasing along the latter half of the path. Although PGB+Semidefinite is also tighter than PGB, the CPU time increased from approximately $-log10(λ)≈-4$ to $-3$. Because the center of PGB is positive semidefinite, only the minimum eigenvalue is required (see section 4.2), but it can still increase the CPU time.

Among the screening methods compared here, our empirical analysis suggests that the use of the spherical rule with PGB, in which a semidefinite constraint is implicitly incorporated in the projection process, is the most cost-effective. We did not observe that the other approach to considering the semidefinite (or relaxed linear) constraint in the rule substantially outperforms PGB in terms of CPU time despite its high screening rate. We observed the same tendency for DGB. The screening rate did not change markedly even if the semidefinite constraint was explicitly considered.

7.1.2  DGB-Based Rules

Next, by using the DGB, we compared the performance of the three rules presented in section 4. Figure 6b shows the results, which are similar to those obtained for the GB, shown in Figure 6a. The semidefinite and the linear constraint slightly improve the rate. However, the large computational cost for screening with the semidefinite constraint caused the overall CPU time to increase. Therefore, although the linear constraint is much easier to evaluate, the CPU time was almost the same as that required for the DGB because of the slight improvement in the screening rate.

7.2  Comparing Bounds

Here we compare the screening performance (screening rate and CPU time) of each bound introduced in section 3 by using algorithm 2 without the range-based screening process. We do not use RPB because it needs the strictly optimal previous solution.

Based on the results in the previous section, we employed the spherical rule. The result obtained for the phishing data set is shown in Figure 7. The screening rate of the GB (top right) again decreased from the middle of the horizontal axis compared with the other spheres. The other spheres also have lower screening rates for small values of $λ$s. As mentioned in section 6.1, the radii of GB, DGB, RPB, and RRPB have the form $r2=a+b1λ+c1λ2$, meaning that if $λ→0$, then $r→∞$. In the case of the PGB, although the dependency on $λ$ cannot be written explicitly, the same tendency was observed. We see that the PGB and RRPB have similar results as suggested by theorem 7, and the screening rate of the DGB is lower than that of the RRPB, as suggested by theorem 8. A comparison of the PGB and RRPB indicated that the former achieved a higher screening rate, but the latter is more efficient with less CPU time, as shown in the plot at the bottom right, because the PGB requires a matrix inner product calculation for each triplet. Bounds other than the GB are more than twice as fast as the naive calculation for most values of $λ$.

Figure 7:

Comparison of spherical bounds on the phishing data set. The heat maps on the left show the dynamic screening rate. The vertical axes of these heat maps represent the number of iterative steps required for the optimization divided by 10 to perform the screening. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with naıve optimization.

Figure 7:

Comparison of spherical bounds on the phishing data set. The heat maps on the left show the dynamic screening rate. The vertical axes of these heat maps represent the number of iterative steps required for the optimization divided by 10 to perform the screening. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with naıve optimization.

Close modal

A comparison of the dynamic screening rate (the three plots on the left in Figure 7) of PGB and RRPB shows that the rate of PGB is higher. In terms of the regularization path screening (top right), RRPB and PGB have similar screening rates, but PGB has a higher dynamic screening rate. Along the latter half of the regularization path, the number of gradient descent iterations increases; consequently, the dynamic screening significantly affects the CPU time, and the PGB becomes faster despite the additional computation it requires to compute the inner product.

We further evaluate the performance of the range-based extension described in section 6.1. Figure 8 shows the rate of the range-based screening for the segment data set. The figure shows that a wide range of $λ$ can be screened, particularly for small values of $λ$; although the range is smaller for large values of $λ$, than for the small values, a high screening rate is observed when $λ$ approaches $λ0$. A significant advantage of this approach is that for those triplets screened by using the specified range, we no longer need to evaluate the screening rule as long as $λ$ is within the range.

Figure 8:

Screening rate of range-based screening on the segment data set. The color indicates the screening rate for $λ$ on the vertical axis based on the reference solution using $λ0$ on the horizontal axis. The accuracy of the reference solution is $10-4$ and $10-6$ for the plots on the left and right, respectively.

Figure 8:

Screening rate of range-based screening on the segment data set. The color indicates the screening rate for $λ$ on the vertical axis based on the reference solution using $λ0$ on the horizontal axis. The accuracy of the reference solution is $10-4$ and $10-6$ for the plots on the left and right, respectively.

Close modal

The total CPU time for the regularization path is shown in Figure 9. In addition to GB, PGB, DGB, and RRPB, we further evaluate the performance when PGB and RRPB are used simultaneously. The use of two rules can improve the screening rate; however, additional computations are required to evaluate the rule. In the figure, for four out of six data sets, the PGB+RRPB combination requires the least CPU time.

Figure 9:

Total CPU time of regularization path (seconds). The term RRPB+PGB indicates that the spherical rules are performed with these two spheres. “Baseline” indicates the computational time without screening. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

Figure 9:

Total CPU time of regularization path (seconds). The term RRPB+PGB indicates that the spherical rules are performed with these two spheres. “Baseline” indicates the computational time without screening. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

Close modal

7.3  Evaluating the Practical Efficiency

We next considered a computationally more expensive setting to evaluate the effectiveness of the safe screening approach in a practical situation. To investigate the regularization path more precisely, we set a finer grid of regularization parameters defined as $λt=0.99λt-1$. We also incorporated the well-known active set heuristics to conduct our experiments on larger data sets. Note that because of the above differences, the computational time shown here cannot be directly compared with the results in sections 7.1 and 7.2. The active set method uses only a subset of triplets of which the loss is greater than 0 as the active set. The gradient is calculated by using only the active set, and the overall optimality is confirmed when the iteration converges. We employed the active set update strategy shown by Weinberger and Saul (2009), in which the active set is updated once every ten iterative cycles.

Table 3 compares the CPU time for the entire regularization path. Based on the results in the previous section, we employed RRPB and RRPB+PGB (evaluating rules based on both spheres) for triplet screening. Further, the range-based screening described in section 6.1 is also performed using RRPB, for which we evaluate the range at the beginning of the optimization for each $λ$, as shown in algorithm 2. Our safe triplet screening accelerates the optimization process by up to 10 times compared to the simple active set method. The results for higher-dimensional data sets with a diagonal $M$ are presented in section 7.4.1.

Table 3:
Evaluation of the Total CPU Time (Seconds) with the Active Set Method.
Method\Data SetphishingSensITa9amnistcifar10rcv
ActiveSet 7989.5 16,352.1 758.7 3788.1 11085.7 94996.3
ActiveSet+RRPB 2126.2 3555.6 70.1 871.1 1431.3 43174.9
ActiveSet+RRPB+PGB 2133.2 3046.9 72.1 897.9 1279.7 38231.1
Method\Data SetphishingSensITa9amnistcifar10rcv
ActiveSet 7989.5 16,352.1 758.7 3788.1 11085.7 94996.3
ActiveSet+RRPB 2126.2 3555.6 70.1 871.1 1431.3 43174.9
ActiveSet+RRPB+PGB 2133.2 3046.9 72.1 897.9 1279.7 38231.1

Note: Results in bold indicate the fastest method.

7.4  Empirical Evaluation of Three Special Cases

Here we evaluate three special cases of our formulation: nonsmoothed hinge loss, the Mahalanobis distance with a diagonal matrix, and dynamic screening for a certain value of $λ$.

7.4.1  Nonsmoothed Hinge Loss

In previous experiments, we used the smoothed hinge loss function $γ=0.05$. However, the hinge loss function $γ=0$ is also widely used. Figure 10 shows the screening result of the PGB spherical rule for the segment data. Here, the loss function of RTLM is the hinge loss function, and the other settings are the same as those of the experiments in the main text. The results show that PGB achieved a high screening rate and that the CPU time substantially improved.

Figure 10:

Performance evaluation of PGB for the hinge loss setting. The heat map on the left shows the dynamic screening rate with the vertical axis showing the number of iterative cycles for optimization divided by 10 at which screening is performed. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with the naive optimization.

Figure 10:

Performance evaluation of PGB for the hinge loss setting. The heat map on the left shows the dynamic screening rate with the vertical axis showing the number of iterative cycles for optimization divided by 10 at which screening is performed. (Top right) Rate of regularization path screening. (Bottom right) Ratio of CPU time compared with the naive optimization.

Close modal

7.4.2  Learning with Higher-Dimensional Data Using Diagonal Matrix

Here we evaluate the screening performance when the matrix $M$ is confined to being a diagonal matrix. Based on the same setting as section 7.3, comparison with the ActiveSet method is shown in Table 4. We used RRPB and RRPB+PGB, both of which largely reduced the CPU time. Attempts to process the Gisette data set, which has the largest dimension, 5,000, with the active set method were unsuccessful and the method did not terminate even after 250,000 s.

Table 4:
Total Time (Seconds) of the Regularization Path for Diagonal $M$.
ActiveSet 2485.5 7005.8 3149.8 –
ActiveSet+RRPB 326.7 593.4 632.2 133,870.0
ActiveSet+RRPB+PGB 336.6 562.4 628.2 127,123.8
#dimension 256 500 2000 5000
#samples 7291 2000 62 6000
#triplet 656,200 720,400 38,696 1,215,225
$k$ 10 20 $∞$ 15
$λmax$ 1.0e+7 2.0e+14 5.0e+7 4.5e+8
$λmin$ 1.9e+3 4.7e+11 7.0e+3 2.1e+3
ActiveSet 2485.5 7005.8 3149.8 –
ActiveSet+RRPB 326.7 593.4 632.2 133,870.0
ActiveSet+RRPB+PGB 336.6 562.4 628.2 127,123.8
#dimension 256 500 2000 5000
#samples 7291 2000 62 6000
#triplet 656,200 720,400 38,696 1,215,225
$k$ 10 20 $∞$ 15
$λmax$ 1.0e+7 2.0e+14 5.0e+7 4.5e+8
$λmin$ 1.9e+3 4.7e+11 7.0e+3 2.1e+3

Notes: The results in bold indicate the fastest method. The Gisette data set did not produce results by ActiveSet because of the time limitation.

7.4.3  Dynamic Screening for Fixed $λ$

Here, we evaluate the performance of dynamic screening for a fixed $λ$. For $λ$, we used $λmin$ in Table 2 for which the screening rate was relatively low in our results thus far (e.g., see Figure 6a). Figure 11 compares the computational time of the naive approach without screening and with the dynamic screening shown in algorithm 1. The plots in Figure 11a show that dynamic screening accelerates the learning process. The plots in Figure 11b show the performance of the active set strategy, indicating that the combination of dynamic screening and the active set strategy is effective for further acceleration.

Figure 11:

Evaluation of the computational time for dynamic screening. The computational time required for (a) dynamic screening (a) without the active set and (b) with the active set. “Baseline” indicates the results obtained for the naive method without screening and the active set strategy. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

Figure 11:

Evaluation of the computational time for dynamic screening. The computational time required for (a) dynamic screening (a) without the active set and (b) with the active set. “Baseline” indicates the results obtained for the naive method without screening and the active set strategy. The bars show the total time and the proportion of the optimization algorithm and the screening calculation.

Close modal

7.5  Effect of Number of Triplets on Prediction Accuracy

Finally, we examine the relation between the number of triplets contained in $T$ and the prediction accuracy of the classification. We employed the nearest-neighbor (NN) classifier to measure the prediction performance of the learned metric. The data set was randomly divided into training data (60%), validation data (20%), and test data (20%). The regularization parameter $λ$ changed from $105$ to 0.1 and was chosen by minimizing the validation error. The experiment was performed 10 times by randomly partitioning the data set in different ways.

The results are shown in Figure 12, which summarizes the CPU time and test error rate for different settings of the number of triplets. The horizontal axes in all four plots, a to d, represent the number of neighbors $k$ used to define the original triplet set $T$ as described at the beginning of section 7. Figure 12a shows the CPU time to calculate the entire regularization path with and without screening. Here “Without Screening” indicates the ActiveSet approach, and “With Screening” indicates the ActiveSet+RRPB approach. These results show that the learning time increases as $k$ increases, and safe triplet screening shows larger decreases in the CPU time for larger values of $k$. Figures 12b to 12d show the test error rates, each calculated by 10 NN, 20 NN, and 30 NN classifiers, respectively. In Figure 12b, the 10 NN test error is minimized at $k=6$, with screening requiring less than approximately 2,000 seconds, whereas the naive approach (Without Screening) can calculate only approximately $k=4$ in the same computational time. In Figure 12c, the 20 NN test error is minimized at $k=12$, with screening requiring approximately 4000 seconds, whereas the naıve approach can calculate only approximately $k=8$. In Figure 12d, the 30 NN test error is minimized at $k=15$, with screening requiring approximately 5000 seconds, whereas the naïve approach can calculate only approximately $k=9$. These results indicate that the number of neighbors, $k$, significantly affects the prediction accuracy, and sufficiently large $k$ is often necessary to achieve the best prediction performance.

Figure 12:

CPU time (seconds) and test error rate on the phishing data set.

Figure 12:

CPU time (seconds) and test error rate on the phishing data set.

Close modal

We introduced safe triplet screening for large-margin metric learning. Three screening rules and six spherical bounds were derived, and the relations among them were analyzed. We further proposed a range-based extension for the regularization path calculation. Our screening technique for metric learning is particularly significant compared with other screening studies because of the large number of triplets and the semidefinite constraint. Our numerical experiments verified the effectiveness of safe triplet screening using several benchmark data sets.

To derive the dual problem, we first rewrite the primal problem as
$minM,t∑ijlℓtijl+λR(M)s.t.M⪰O,tijl=〈M,Hijl〉,$
where $t$ is a $|T|$-dimensional vector that contains all $tijl$ for $(i,j,l)∈T$ and
$R(M)=12MF2.$
(A.1)
The Lagrange function is
$L(M,t,α,Γ):=∑ijlℓtijl+λR(M)+∑ijlαijl(tijl-〈M,Hijl〉)-〈M,Γ〉,$
where $α∈R|T|$ and $Γ∈R+d×d$ are Lagrange multipliers. Let
$ℓ*(-αijl):=suptijl{(-αijl)tijl-ℓtijl},$
(A.2)
$R*(Mλ(α,Γ)):=supM{〈Mλ(α,Γ),M〉-R(M)},$
(A.3)
be convex conjugate functions (Boyd & Vandenberghe, 2004) of $ℓ$ and $R$, where
$Mλ(α,Γ):=1λ∑ijlαijlHijl+Γ.$
(A.4)
Then the dual function is written as
$Dλ(α,Γ):=infM,tL(M,t,α,Γ)=-∑ijlsuptijl{(-αijl)tijl-ℓtijl}-λsupM{〈Mλ(α,Γ),M〉-R(M)}=-∑ijlℓ*(-αijl)-λR*(Mλ(α,Γ)).$
From the Karush-Kuhn-Tucker (KKT) condition, we obtain
$∇ML=λ∇R(M)-λMλ(α,Γ)=O,$
(A.5a)
$∇tijlL=∇ℓ(tijl)+αijl=0,$
(A.5b)
$Γ⪰O,M⪰O,〈M,Hijl〉=tijl,〈M,Γ〉=0,$
(A.5c)
where, in the case of hinge loss,
$∇ℓ(x)=0,x>1,-c,x=1,-1,x<1,$
where $∀c∈[0,1]$, and in the case of smoothed hinge loss,
$∇ℓ(x)=0,x>1,-1γ(1-x),1-γ≤x≤1,-1,x<1-γ.$
From these two equations and equation A.5b, we see that $0≤α≤1$. Substituting equation A.5b into equation A.2 and considering the above constraint, the conjugate of the loss function $ℓ$ can be transformed into
$ℓ*(-αijl)=γ2αijl2-αijl.$
Note that this equation holds for the cases of both hinge loss (by setting $γ=0$) and smoothed hinge loss ($γ>0$). Substituting equation A.5a into A.3, the conjugate of the regularization term $R$ is written as
$R*(Mλ(α,Γ))=R(Mλ(α,Γ))=12∥Mλ(α,Γ)∥F2.$
Therefore, the dual problem is
$max0≤α≤1,Γ⪰ODλ(α,Γ)=-∑ijlℓ*(-αijl)-λ2∥Mλ(α,Γ)∥F2.$
(Dual1)
Because the second term, $maxΓ⪰O-12Mλ(α,Γ)F2$, is equivalent to the projection onto a semidefinite cone (Boyd & Xiao, 2005; Malick, 2004), the above problem (Dual1) can be simplified as
$max0≤α≤1Dλ(α):=-γ2∥α∥22+α⊤1-λ2Mλ(α)F2,$
(Dual2)
where
$Mλ(α):=1λ∑ijlαijlHijl+.$
For the optimal $M★$, each triplet in $T$ can be categorized into the following three groups:
$L★:={(i,j,l)∈T∣〈Hijl,M★〉<1-γ},C★:={(i,j,l)∈T∣1-γ≤〈Hijl,M★〉≤1},R★:={(i,j,l)∈T∣〈Hijl,M★〉>1}.$
(A.6)
Based on equations A.5b and A.5c in becomes clear that $αijl★=-∇ℓ(〈M★,Hijl〉)$, by which the following rules are obtained:
$(i,j,l)∈L★⇒αijl★=1,(i,j,l)∈C★⇒αijl★∈[0,1],(i,j,l)∈R★⇒αijl★=0.$
(A.7)
The reduced-size problem can be represented by
$minM,t∑(i,j,l)∈T˜ℓ(tijl)+∑(i,j,l)∈L^1-γ2-tijl+λ2∥M∥F2s.t.tijl=〈M,Hijl〉(i,j,l)∈T,M⪰O.$
Then the Lagrangian is
$L˜(M,t,α,Γ)=∑(i,j,l)∈T˜ℓ(tijl)+∑(i,j,l)∈L^1-γ2-tijl+λ2∥M∥F2+∑(i,j,l)∈Tαijl(tijl-〈M,Hijl〉)-〈M,Γ〉.$
(B.1)
The dual function is written as
$D˜λ(α,Γ):=infM,tL˜(M,t,α,Γ)=-∑(i,j,l)∈T˜suptijl{(-αijl)tijl-ℓtijl}-∑(i,j,l)∈R^suptijl{(-αijl)tijl}-∑(i,j,l)∈L^suptijl{(1-αijl)tijl}+(1-γ2)|L^|-λsupM{〈Mλ(α,Γ),M〉-R(M)},$
where $R(M)$ and and $Mλ(α,Γ)$ are defined by equations A.1 and A.4, respectively. Based on the second and third terms of the previous equation, we see
$αijl=0,∀(i,j,l)∈R^,$
(B.2)
$αijl=1,∀(i,j,l)∈L^,$
(B.3)
which prevent $D˜λ$ from approaching $∞$. Then constraints B.2 and B.3 enable us to further transform the dual objective into
$D˜λ(α,Γ)=-∑(i,j,l)∈T˜ℓ*(-αijl)+∑(i,j,l)∈L^1-γ2-λ2∥Mλ(α,Γ)∥F2=-γ2∥αT˜∥22+αT˜⊤1+1-γ2|L^|-λ2∥Mλ(α,Γ)∥F2=-γ2∥α∥22+α⊤1-λ2∥Mλ(α,Γ)∥F2.$
Thus, the dual problem is written as
$max0≤α≤1,Γ⪰ODλ(α,Γ)=-γ2∥α∥22+α⊤1-λ2∥Mλ(α,Γ)∥F2s.t.αL^=1,αR^=0.$
(B.4)
This is the same optimization problem as Dual1 except that $αL^$ and $αR^$ are fixed as the optimal value in Dual1. This obviously indicates that problems B.4 and Dual1 have the same optimal solution. Given the optimal dual variables $α★$ and $Γ★$, the optimal primal $M★$ can be derived by
$M=1λ∑(i,j,l)∈TαijlHijl+Γ,$
(B.5)
which is from $∇ML˜=0$. Because equation B.5 is exactly the same transformation as equation 2.4, the same optimal primal $M★$ must be obtained.$□$

The following theorem is a well-known optimality condition for the general convex optimization problem:

Theorem 12
(Optimality Condition of Convex Optimization, Bertsekas, 1999). In the minimization problem $minx∈Ff(x)$ where the feasible region $F$ and the function $f(x)$ are convex, and the necessary and sufficient condition that $x★$ is the optimal solution is
$∃∇f(x★)∈∂f(x★)∇f(x★)⊤(x★-x)≤0,∀x∈F,$
where $∂f(x★)$ represents the set of subgradients in $x★$.
From theorem 12, the following holds for the optimal solution $M★$:
$〈∇Pλ(M★),M★-M〉≤0,∀M⪰O.$
(C.1)
Let $Ξijl(M)$ be the subgradient of the loss function $ℓ(〈M,Hijl〉)$ at $M$. Then $∇Pλ(M)$ is written as
$∇Pλ(M)=∑ijlΞijl(M)+λM.$
(C.2)
From the convexity of the (smoothed) hinge loss function $ℓ(〈M,Hijl〉)$, we obtain
$ℓ(〈M★,Hijl〉)≥ℓ(〈M,Hijl〉)+〈Ξijl(M),M★-M〉,ℓ(〈M,Hijl〉)≥ℓ(〈M★,Hijl〉)+〈Ξijl(M★),M-M★〉,$
$〈Ξijl(M★),M★-M〉≥〈Ξijl(M),M★-M〉.$
(C.3)
Combining equations C.1, to C.3 results in
$∑ijlΞijl(M)+λM★,M★-M≤0⇔〈∇Pλ(M)-λM+λM★,M★-M〉≤0.$
By transforming this inequality based on completing the square, we obtain GB.$□$
Let $QGB$ be the center of the GB hypersphere and $rGB$ be the radius. The optimal solution exists in the following set:
${X∣∥X-QGB∥F2≤rGB2,X⪰O}.$
(D.1)
By transforming the sphere of GB, we obtain
$∥X-QGB∥F2=∥X-(Q+GB+Q-GB)∥F2=∥X-Q+GB∥F2+2〈X,-Q-GB〉+2〈Q+GB,Q-GB〉+∥Q-GB∥F2.$
Because $X⪰O$ and $-Q-GB⪰O$, we see $〈X,-Q-GB〉≥0$. Furthermore, using $〈Q+GB,Q-GB〉=0$, we obtain the following sphere:
$rGB2≥∥X-QGB∥F2≥∥X-Q+GB∥F2+∥Q-GB∥F2.∴∥X-Q+GB∥F2≤rGB2-∥Q-GB∥F2.$
Letting $QPGB:=Q+GB$ and $rPGB2:=rGB2-∥Q-GB∥F2$, PGB is obtained. Note that by considering $〈X,-Q-GB〉≥0$ instead of $X⪰O$ in equation D.1, we can immediately see that GB with the linear constraint $〈X,-Q-GB〉≥0$ is tighter than PGB.$□$
In general, a function $f(x)$ is an $m$-strongly convex function if $f(x)-m2x22$ is convex. Because the objective function $Pλ(M)$ is a $λ$-strongly convex function, we obtain
$Pλ(M)≥Pλ(M★)+〈∇Pλ(M★),M-M★〉+λ2M-M★F2.$
From the optimal condition, equation C.1, the second term on the right-hand side is greater than or equal to 0, and from the weak duality, $Pλ(M★)≥Dλ(α,Γ)$. Therefore, we obtain theorem 4.$□$

For the DGB, we show that if the primal and dual reference solutions satisfy equation 2.4, the radius can be $2$ times smaller. We extend the dual-based screening of SVM (Zimmert et al., 2015) for RTLM.

Theorem 14
(CDGB). Let $α$ and $Γ$ be the feasible solutions of the dual problem. Then the optimal solution of the primal problem $M★$ exists in the following hypersphere:
$M★-Mλ(α,Γ)F2≤GDλ(α,Γ)/λ.$
Proof.
Let $GDλ(α,Γ):=Pλ(Mλ(α,Γ))-Dλ(α,Γ)$ be the duality gap as a function of the dual feasible solutions $α$ and $Γ$. The following equation is the duality gap as a function of the primal feasible solution $M$ in which the dual solutions are optimized:
$GPλ(M):=min0≤α≤1,Γ⪰O,Mλ(α,Γ)=MGDλ(α,Γ)=Pλ(M)-max0≤α≤1,Γ⪰O,Mλ(α,Γ)=MDλ(α,Γ).$
From the definition, we obtain
$GDλ(α,Γ)≥GPλ(Mλ(α,Γ)).$
(F.1)
From the strong convexity of $GPλ$ shown in section F.1, the following holds for any $Mλ(α,Γ)$ and $M★⪰O$:
$GPλ(Mλ(α,Γ))≥GPλ(M★)+〈∇GPλ(M★),Mλ(α,Γ)-M★〉+λMλ(α,Γ)-M★F2.$
We assume that $M★$ is the optimal solution of the primal problem. Then, because $M★$ is also a solution to the convex optimization problem $minM⪰OGPλ(M)$, it becomes clear that $〈∇GPλ(M★),Mλ(α,Γ)-M★〉≥0$ from theorem 12. Considering $GPλ(M★)=0$ and $GDλ(α,Γ)≥GPλ(Mλ(α,Γ))$, both of which are from the definition, we obtain
$GDλ(α,Γ)≥GPλ(Mλ(α,Γ))≥λMλ(α,Γ)-M★F2.$
Dividing by $λ$, CDGB is derived.$□$

We name this bound the constrained duality gap bound (CDGB), of which the radius converges to 0 at the optimal solution, because the CDGB also has a radius proportional to the square root of the duality gap. For primal-based optimizers, additional calculation is necessary for $Pλ(Mλ(α,Γ))$, whereas dual-based optimizers calculate this term in the optimization process.

F.1  Proof of Strong Convexity of $GPλ$

We first define an $m$-strongly convex function as follows:

Definition 1

($m$-strongly Convex Function). When $f(x)-m2x22$ is a convex function, $f(x)$ is an $m$-strongly convex function.

According to definition 1, to show that $GPλ$ is strongly convex, we need to show that the term other than $λMF2$ is convex:
$GPλ(M)=∑ijlℓ(〈M,Hijl〉)︸convex+λMF2+min0≤α≤1,Γ⪰O,Mλ(α,Γ)=M∑ijlℓ*(-αijl)︸:=g(α)︸:=f(M).$
Because the loss $ℓ$ is convex, we need to show that $f(M)$ is convex. This can be shown as
$f(M)=min0≤α≤1,Γ⪰O,1λ∑ijlαijlHijl+Γ=Mg(α)=min0≤α≤1,1λ∑ijlαijlHijl≼Mg(α).$
Consider a point $M2=tM0+(1-t)M1(t∈[0,1]),$ which internally divides two points $M0$ and $M1$. Let
$αi★:=argmin0≤α≤1,1λ∑ijlαijlHijl≼Mi,g(α),$
which means that $αi★$ is the minimizer of this problem for a given $Mi(i∈{0,1,2})$, and from the definition, we see $f(Mi)=g(αi★).$ Further, let $α2=tα0★+(1-t)α1★$. Then, $0≤α2≤1$ and $1λ∑ijlα2,ijlHijl≼M2$. Because $g$ is convex because of the convexity of $ℓ*$, we have
$tf(M0)+(1-t)f(M1)=tg(α0★)+(1-t)g(α1★)≥g(tα0★+(1-t)α1★︸α2)≥g(α2★)=f(tM0+(1-t)M1︸M2).$
Hence, $f(M)$ is convex and $GPλ$ is a strongly convex function.
The optimality condition, theorem 12, in the dual problem, Dual1, for $λ0,λ1$ determines that
$∇αDλ0(α0★,Γ0★)⊤(α1★-α0★)+〈∇ΓDλ0(α0★,Γ0★),Γ1★-Γ0★〉≤0,∇αDλ1(α1★,Γ1★)⊤(α0★-α1★)+〈∇ΓDλ1(α1★,Γ1★),Γ0★-Γ1★〉≤0.$
By adding these two equations, we obtain
$[∇αDλ0(α0★,Γ0★)-∇αDλ1(α1★,Γ1★)]⊤(α1★-α0★)+〈∇ΓDλ0(α0★,Γ0★)-∇ΓDλ1(α1★,Γ1★),Γ1★-Γ0★〉≤0.$
Next, we consider the following difference of gradient:
$∇αijlDλ0(α0★,Γ0★)-∇αijlDλ1(α1★,Γ1★)=-γ(α0★ijl-α1★ijl)-〈Hijl,M0★-M1★〉,∇ΓDλ0(α0★,Γ0★)-∇ΓDλ1(α1★,Γ1★)=-(M0★-M1★).$
Defining $Ht★:=∑ijlαt★ijlHijl,$$Mt★$ is rewritten as $Mt★=1λt[Ht★+Γt★]$. Then
$γ∥α1★-α0★∥22-〈H1★-H0★,M0★-M1★〉-〈M0★-M1★,Γ1★-Γ0★〉≤0⇔γ∥α1★-α0★∥22-〈λ1M1★-λ0M0★,M0★-M1★〉≤0⇒-〈λ1M1★-λ0M0★,M0★-M1★〉≤0.$
Transformation of this inequality based on completing the square allows the RPB to be obtained.$□$
Considering a hypersphere that expands the RPB radius by $λ0+λ12λ1ε$ and replaces the RPB center with $λ0+λ12λ1M0$, we obtain
$M1★-λ0+λ12λ1M0F≤|λ0-λ1|2λ1M0★F+λ0+λ12λ1ε.$
Because $ε$ is defined by $∥M0★-M0∥F≤ε$, this sphere covers any RPB created by $M0★$, which satisfies $∥M0★-M0∥F≤ε$ (see Figure 2d for a geometrical illustration). Using the reverse triangle inequality,
$∥M0★∥F-∥M0∥F≤∥M0★-M0∥F≤ε,$
we obtain
$M1★-λ0+λ12λ1M0F≤|λ0-λ1|2λ1(M0F+ε)+λ0+λ12λ1ε.$
By rearranging this, RRPB is obtained.$□$
When the dual variable is used as the subgradient of the (smoothed) hinge loss at the optimal solution $M0★$ of $λ0$ (from equation A.7, the optimal dual variable provides a valid subgradient), the gradient of the objective function in the case of $λ1$ is written as
$∇Pλ1(M0★)=-H0★+λ1M0★,$
where
$H0★:=-∑ijl∇ℓ(〈M0★,Hijl〉)Hijl=∑ijlα0★ijlHijl.$
Because $λ0M0★=H0★+$,
$∇Pλ1(M0★)=-(H0★++H0★-)+λ1M0★=(λ1-λ0)M0★-H0★-.$
Then the center and radius of GB are
$QGB=M0★-12λ1∇Pλ1(M0★)=(λ0+λ1)M0★+H0★-2λ1,rGB2=(λ1-λ0)M0★-H0★-F24λ12=(λ1-λ0)M0★F2-2(λ1-λ0)〈M0★,H0★-〉+H0★-F24λ12=(λ0-λ1)M0★F2+H0★-F24λ12.$
Here, the last equation of $rGB2$ uses the fact that $M0*$ and $H0*-$ are orthogonal. Using $QGB$ and $rGB2$, the center and radius of PGB are found to be
$QPGB=Q+GB=(λ0+λ1)M0★2λ1,Q-GB=H0★-2λ1,rPGB2=rGB2-Q-GBF2=(λ0-λ1)M0★F24λ12.$
Therefore, PGB coincides with RPB.$□$
At the optimal solution $M0★,α0★$ and $Γ0★$ of $λ0$, we obtain the following equation from $Pλ0(M0★)=Dλ0(α0★,Γ0★)$ and $Mλ0(α0★,Γ0★)=M0★$:
$∑ijlℓ(〈M0★,Hijl〉)+∑ijlℓ*(-α0★ijl)=-λ0M0★F2.$
We also see $Mλ1(α0★,Γ0★)=λ0λ1Mλ0(α0★,Γ0★)=λ0λ1M0★$. Using these results, the value of the duality gap for $λ1$ is
$Pλ1(M0★)-Dλ1(α0★,Γ0★)=(λ0-λ1)22λ1M0★F2.$
Therefore, the radius of DGB $rDGB$ and the radius of RPB $rRPB$ satisfy the following relationship:
$rDGB2=2(Pλ1(M0★)-Dλ1(α0★,Γ0★))λ1=(λ0-λ1)2λ12M0★F2=4rRPB2.$
Furthermore, the centers of these hyperspheres are
$QDGB=M0★,QRPB=λ0+λ12λ1M0★,$
and the distance between the centers is
$∥QDGB-QRPB∥F=|λ0-λ1|2λ1∥M0★∥F=rRPB.$
Thus, the DGB includes the RPB as illustrated in Figure 2c.$□$
The Lagrange function is defined as
$L(X,α,β):=〈X,Hijl〉-α12(r2-X-QF2)-β〈P,X〉.$
From the KKT condition, we obtain
$∂L/∂X=Hijl+α(X-Q)-βP=O.$
(K.1a)
$α≥0,β≥0,X-QF2≤r2,〈P,X〉≥0.$
(K.1b)
$α(r2-X-QF2)=0,β〈P,X〉=0.$
(K.1c)
If $α=0$, then $Hijl=βP$ from equation K.1a, and the value of the objective function becomes $〈X,Hijl〉=β〈X,P〉=0$ from equation K.1c. Let us consider the case of $α≠0$. From equation K.1c, it becomes clear that $X-QF2=r2$. If $β=0$, the linear constraint is not an active constraint (i.e., $〈P,X〉>0$ at the optimal); hence, it is the same as problem P1, which can be analytically solved. If this solution satisfies the linear constraint $〈P,X〉≥0$, it becomes the optimal solution. Next, we consider the case of $β≠0$. From equations K.1a and K.1c, $α$ and $β$ are obtained as
$α=±∥P∥F2∥Hijl∥F2-〈P,Hijl〉2r2∥P∥F2-〈P,Q〉2,β=〈P,Hijl〉-α〈P,Q〉∥P∥F2.$
Of the solutions of the two values of $α$, $α>0$ gives the minimum value from equation K.1b.$□$

L.1  Generalized Form of GB, DGB, RPB, and RRPB

L.1.1  GB

$∇Pλ(M)=Ξ+λM.$
Then, the squared norm of this gradient is
$∥∇Pλ(M)∥F2=∥Ξ∥F2+2λ〈Ξ,M〉+λ2∥M∥F2.$
By substituting this into the center and the radius of GB, we obtain
$rGB2=14λ2∥∇Pλ(M)∥F2=14λ2∥Ξ∥F2+2λ〈Ξ,M〉+λ2∥M∥F2=14∥M∥F2+12λ〈Ξ,M〉14λ2∥Ξ∥F2,QGB=M-12λ(Ξ+λM)=12M-12λΞ.$

L.1.2  DGB

The duality gap is written as
$gap=∑ijl(ℓ(〈M,Hijl〉)+ℓ*(-αijl))+λ2∥M∥F2+12λ∥∑ijlαijlHijl+Γ∥F2.$
Then the center and radius of DGB are
$rDGB2=2gapλ=2λ(∑ijl(ℓ(〈M,Hijl〉)+ℓ*(-αijl))+λ2∥M∥F2+12λ∑ijlαijlHijl+ΓF2)=∥M∥F2+2λ∑ijl(ℓ(〈M,Hijl〉)+ℓ*(-αijl))+1λ2∑ijlαijlHijl+ΓF2,QDGB=M.$

L.1.3  RPB

With respect to RPB, we regard $λ1$ as the target $λ$ for which we consider the range. From the definition, we see
$QRPB=λ0+λ2λM0★=12M0★+λ02λM0★,rRPB=λ0-λ2λM0★F=-12M0★F+λ02λM0★F.$

L.1.4  RRPB

Here again, we regard $λ1$ as the target $λ$ for which we consider the range. First, we assume $λ≤λ0$. Then we have
$QRRPB=λ0+λ2λM0=12M0+λ02λM0,rRRPB=λ0-λ2λ∥M0∥F+λ0λε=-12∥M0∥F+1λλ02∥M0∥F+λ0ε.$
In the case of $λ≥λ0$, we have
$QRRPB=12M0+λ02λM0,rRRPB=λ-λ02λ∥M0∥F+ε=ε+12∥M0∥F-λ02λ∥M0∥F.$

L.2  Proof of Theorem 10 (Range-Based Extension of RRPB)

In RRPB, we replace $λ1$ with $λ$ and assume $λ≤λ0$. Then,
$QRRPB=λ0+λ2λM0,rRRPB=λ0-λ2λ∥M0∥F+λ0λε.$
From the spherical rule, equation 4.1, we obtain
$λ+λ02λ〈Hijl,M0〉-λ0-λ2λ∥M0∥F+λ0λε∥Hijl∥F>1⇔(〈Hijl,M0〉-2+∥Hijl∥F∥M0∥F)︸>0isrequired.λ>λ0(∥M0∥F∥Hijl∥F-〈Hijl,M0〉︸≥0+2ε∥Hijl∥F).$
The Cauchy-Schwarz inequality determines that the right-hand side is equal to or greater than 0; therefore, the left-hand side must be greater than 0:
$∴λ0≥λ>λ0(∥M0∥F∥Hijl∥F-〈Hijl,M0〉+2ε∥Hijl∥F)〈Hijl,M0〉-2+∥Hijl∥F∥M0∥F.$
In the case of $λ≥λ0$,
$QRRPB=λ0+λ2λM0,rRRPB=λ-λ02λ∥M0∥F+ε.$
From spherical rule 4.1,
$λ+λ02λ〈Hijl,M0〉-λ-λ02λ∥M0∥F+ε∥Hijl∥F>1⇔(∥Hijl∥F∥M0∥F-〈Hijl,M0〉︸≥0+2+2ε∥Hijl∥F)λ<λ0(∥M0∥F∥Hijl∥F+〈Hijl,M0〉).$
Similarly, the Cauchy-Schwarz inequality determines that the left-hand side is greater than 0:
$∴λ0≤λ<λ0(∥M0∥F∥Hijl∥F+〈Hijl,M0〉)∥Hijl∥F∥M0∥F-〈Hijl,M0〉+2+2ε∥Hijl∥F.$

$□$

The spherical rule is written as
$〈Hijl,Q〉-R∥Hijl∥F>1⇒(i,j,l)∈R★.$
This inequality is equivalent to the following two inequalities:
$(〈Hijl,Q〉-1)2>R2∥Hijl∥F2,〈Hijl,Q〉>1.$
By using equations 6.1 and 6.3, these inequalities can be transformed into
$Hijl,A+B1λ-12>a+b1λ+c1λ2+2ελ2∥Hijl∥F2,$
(M.1)
$Hijl,A+B1λ>1.$
(M.2)
Note that the definitions of $a$, $b$, $c$, $A$, and $B$ for each bound are shown in section L.1. Because inequality M.2 can be written as a linear inequality of $λ$, we can easily obtain the range of $λ$ that satisfies the inequality. On the other hand, inequality M.1 is equivalent to
$Hijl,A+B1λ-12>a+b1λ+c1λ2+2ελ+2a+b1λ+c1λ22ελ∥Hijl∥F2⇔Hijl,A+B1λ-12-a+b1λ+c1λ2+2ελ∥Hijl∥F2>2a+b1λ+c1λ22ελ∥Hijl∥F2.$
The last inequality can be transformed into the following two inequalities:
$Hijl,A+B1λ-12-a+b1λ+c1λ2+2ελ∥Hijl∥F22>4a+b1λ+c1λ22ελ∥Hijl∥F4,$
(M.3)
$Hijl,A+B1λ-12>a+b1λ+c1λ2+2ελ∥Hijl∥F2.$
(M.4)
Inequality M.4 is also a quadratic inequality for which we can obtain the range of $λ$ that satisfies the inequality. Although inequality M.3 is a fourth-order inequality, the range of $λ$ can be calculated by using a fourth-order equation solver. Then we obtain the range of $λ$ as the intersection of the ranges derived from equations M.2 to M.4.

N.1  Proof of Theorem 11

Rearranging equation 6.4a, we obtain
$βk=2αxk+(hijl,k-2αqk).$
When we assume $hijl,k-2αqk>0$, we see
$hijl,k-2αqk>0⇒βk=2αxk︸≥0+(hijl,k-2αqk)︸>0>0⇒xk=0.$
The previous equation is derived from the complementary condition $βkxk=0$. When we assume $hijl,k-2αqk≤0$, we have
$hijl,k-2αqk≤0⇒2αxk-βk=-(hijl,k-2αqk)≥0⇒βk=0⇒xk=qk-hijl,k/2α.$
The third equation, $βk=0$, is derived from $xk≥0$, $βk≥0$ and the complementary condition $βkxk=0$, and in the previous equation, the assumption $α>0$ is used. Using the above two derivations, we obtain equation 6.5. Further, from the complementary condition $α(r2-∥x-q∥22)=0$, it is clear that $∥x-q∥22=r2$ because of the assumption $α>0$.$□$

N.2  Proof of Theorem 12

Because we assume that $α=0$, we obtain
$β=hijl$
by using the KKT condition, equation 6.4a. Note that this implicitly indicates that $hijl≥0$ should be satisfied because of the nonnegativity of $β$. The complementary condition $xkβk=0$ determines that
$xk=0ifhijl,k>0.$
(N.1)
To satisfy all the KKT conditions, equation 6.4, we need to set the other $xk$ in such a way that $∥x-q∥22≤r2$ and $x≥0$ are satisfied. Note that the other conditions in equation 6.4 are satisfied for any $x$ because of the assumption $α=0$. By setting
$xk=max{qk,0}forhijl,k=0,$
(N.2)
$∥x-q∥22$ is minimized under the conditions $x≥0$ and equation N.1, and thus the condition $∥x-q∥22≤r2$ should be satisfied when the optimal $α$ is 0.

N.3  Analytical Procedure of Rule Evaluation for the Diagonal Case

We first verify the case $α=0$. If the solution equations 6.8 and 6.9 satisfy all the KKT conditions, equation 6.4, then the solution is optimal. Otherwise, we consider the case $α>0$. Let $S:={k∣xk>0}$ be the support set of $x$, where $xk$ is defined by equation 6.5. When $S$ is regarded as a function of $α$, an element of $S$ can change at which $α$ satisfies
$hijl,k-2αqk=0$
for some $k∈[d]$. Let $α1≤α2≤⋯≤αd'$ for $d'≤d$ be a sequence of those change points that can be found by sorting
$hijl,k2qk|hijl,k2qk>0,qk≠0,k∈[d].$
(N.3)
For notational convenience, we define $α0:=0$. Based on the definition, $S$ is fixed for any $α$ in an interval $(αk,αk+1)$, to which we refer as $Sk$. This means that the support set of the optimal $x$ should be one of $Sk$ for $k=1,…,d'$. Algorithm 3 shows an analytical procedure for calculating the optimal $x$, which verifies the optimality of each one of $Sk$ after considering the case of $α=0$. For each iterative cycle in algorithm 3, $O(d)$ computation is required, and thus the solution can be found by $O(d2)$.

This work was financially supported by grants from the Japanese Ministry of Education, Culture, Sports, Science and Technology awarded to I.T. (16H06538, 17H00758) and M.K. (16H06538, 17H04694); from the Japan Science and Technology Agency (JST) CREST awarded to I.T. (JPMJCR1302, JPMJCR1502) and PRESTO awarded to M.K. (JPMJPR15N2); from the Materials Research by Information Integration Initiative (MI2I) project of the Support Program for Starting Up Innovation Hub from the JST awarded to I.T. and M.K.; and from the RIKEN Center for Advanced Intelligence Project awarded to I.T.

Barzilai
,
J.
, &
Borwein
,
J. M.
(
1988
).
.
IMA Journal of Numerical Analysis
,
8
(
1
),
141
148
.
Bertsekas
,
D. P.
(
1999
).
Nonlinear programming
.
Belmont
:
Athena Scientific
.
Boyd
,
S.
, &
Vandenberghe
,
L.
(
2004
).
Convex optimization
.
Cambridge
:
Cambridge University Press
.
Boyd
,
S.
, &
Xiao
,
L.
(
2005
).
.
SIAM Journal on Matrix Analysis and Applications
,
27
(
2
),
532
546
.
Capitaine
,
H. L.
(
2016
).
Constraint selection in metric learning
.
arXiv:1612.04853
.
Chang
,
C.-C.
, &
Lin
,
C.-J.
(
2011
).
Libsvm: A library for support vector machines
.
ACM Transactions on Intelligent Systems and Technology
,
2
(
3
),
27
.
Chollet
,
F.
et al
, et al. (
2015
).
Keras
. https://github.com/keras-team/keras.
Davis
,
J. V.
,
Kulis
,
B.
,
Jain
,
P.
,
Sra
,
S.
, &
Dhillon
,
I. S.
(
2007
).
Information-theoretic metric learning
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
209
216
).
New York
:
ACM
.
Fercoq
,
O.
,
Gramfort
,
A.
, &
Salmon
,
J.
(
2015
).
Mind the duality gap: Safer rules for the lasso
. In
Proceedings of the 32nd International Conference on Machine Learning
, (pp.
333
342
).
Ghaoui
,
L. E.
,
Viallon
,
V.
, &
Rabbani
,
T.
(
2010
).
Safe feature elimination for the lasso and sparse supervised learning problems
.
arXiv:1009.4219
.
,
H.
,
Shibagaki
,
A.
,
Sakuma
,
J.
, &
Takeuchi
,
I.
(
2018
).
Efficiently monitoring small data modification effect for large-scale learning in changing environment
. In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence
(pp.
1314
1321
).
Palo Alto, CA
:
AAAI Press
.
Hoffer
,
E.
, &
Ailon
,
N.
(
2015
).
Deep metric learning using triplet network
. In
Proceedings of the International Workshop on Similarity-Based Pattern Recognition
(pp.
84
92
).
Berlin
:
Springer
.
Jain
,
L.
,
Mason
,
B.
, &
Nowak
,
R.
(
2017
). Learning low-dimensional metrics. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
4139
4147
).
Red Hook, NY
:
Curran
.
Jain
,
P.
,
Kulis
,
B.
,
Dhillon
,
I. S.
, &
Grauman
,
K.
(
2009
). Online metric learning and fast similarity search. In
D.
Koller
,
D.
Schuurmans
,
Y.
Bengio
, &
L.
Bottou
(Eds.),
Advances in neural information processing systems
,
21
(pp.
761
768
).
Cambridge, MA
:
MIT Press
.
Jamieson
,
K. G.
, &
Nowak
,
R. D.
(
2011
).
Low-dimensional embedding using adaptively selected ordinal data
. In
Proceedings of the 2011 49th Annual Allerton Conference on Communication, Control, and Computing
(pp.
1077
1084
).
Piscataway, NJ
:
IEEE
.
Kulis
,
B.
(
2013
).
Metric learning: A survey
.
Boston
:
Now Publishers
.
Law
,
M. T.
,
Thome
,
N.
, &
Cord
,
M.
(
2013
).
. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
249
256
).
Piscataway, NJ
:
IEEE
.
Lee
,
S.
, &
Xing
,
E. P.
(
2014
).
Screening rules for overlapping group lasso
.
arXiv:1410.6880
.
Lehoucq
,
R. B.
, &
Sorensen
,
D. C.
(
1996
).
Deflation techniques for an implicitly restarted Arnoldi iteration
.
SIAM Journal on Matrix Analysis and Applications
,
17
(
4
),
789
821
.
Li
,
D.
, &
Tian
,
Y.
(
2018
).
Survey and experimental study on metric learning methods
.
Neural Networks
,
105
,
447
462
.
Liu
,
J.
,
Zhao
,
Z.
,
Wang
,
J.
, &
Ye
,
J.
(
2014
).
Safe screening with variational inequalities and its application to lasso
. In
Proceedings of the International Conference on Machine Learning
(pp.
289
297
).
Malick
,
J.
(
2004
).
A dual approach to semidefinite least-squares problems
.
SIAM Journal on Matrix Analysis and Applications
,
26
(
1
),
272
284
.
McFee
,
B.
, &
Lanckriet
,
G. R.
(
2010
).
Metric learning to rank
. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
775
782
).
:
Omnipress
.
Nakagawa
,
K.
,
Suzumura
,
S.
,
Karasuyama
,
M.
,
Tsuda
,
K.
, &
Takeuchi
,
I.
(
2016
).
Safe pattern pruning: An efficient approach for predictive pattern mining
. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1785
1794
).
New York
:
ACM
.
Ndiaye
,
E.
,
Fercoq
,
O.
,
Gramfort
,
A.
, &
Salmon
,
J.
(
2016
). Gap safe screening rules for sparse-group lasso. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
388
396
).
Red Hook, NY
:
Curran
.
Ogawa
,
K.
,
Suzuki
,
Y.
, &
Takeuchi
,
I.
(
2013
).
Safe screening of non-support vectors in pathwise SVM computation
. In
Proceedings of the 30th International Conference on Machine Learning
(pp.
1382
1390
).
Okumura
,
S.
,
Suzuki
,
Y.
, &
Takeuchi
,
I.
(
2015
).
Quick sensitivity analysis for incremental data modification and its application to leave-one-out CV in linear classification problems
. In
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
885
894
).
New York
:
ACM
.
Perrot
,
M.
, &
Habrard
,
A.
(
2015
). Regressive virtual metric learning. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
1810
1818
).
Red Hook, NY
:
Curran
.
Schroff
,
F.
,
Kalenichenko
,
D.
, &
Philbin
,
J.
(
2015
).
Facenet: A unified embedding for face recognition and clustering
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
815
823
).
Piscataway, NJ
:
IEEE
.
Schultz
,
M.
, &
Joachims
,
T.
(
2004
). Learning a distance metric from relative comparisons. In
S.
Thrun
,
L. K.
Saul
, &
B.
Schölkopf
(Eds.),
Advances in neural information processing systems
,
16
(pp.
41
48
).
Cambridge, MA
:
MIT Press
.
Shen
,
C.
,
Kim
,
J.
,
Liu
,
F.
,
Wang
,
L.
, & Van Den
Hengel
,
A.
(
2014
).
Efficient dual approach to distance metric learning
.
IEEE Transactions on Neural Networks and Learning Systems
,
25
(
2
),
394
406
.
Shi
,
Y.
,
Bellet
,
A.
, &
Sha
,
F.
(
2014
).
Sparse compositional metric learning
. In
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence
.
Palo Alto, CA
:
AAAI Press
.
Shibagaki
,
A.
,
Karasuyama
,
M.
,
Hatano
,
K.
, &
Takeuchi
,
I.
(
2016
).
Simultaneous safe screening of features and samples in doubly sparse modeling
. In
Proceedings of the 33rd International Conference on Machine Learning
(pp.
1577
1586
).
Shibagaki
,
A.
,
Suzuki
,
Y.
,
Karasuyama
,
M.
, &
Takeuchi
,
I.
(
2015
). Regularization path of cross-validation error lower bounds. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 28
(pp.
1675
1683
).
Red Hook, NY
:
Curran
.
,
T.
,
,
H.
,
,
Y.
,
Sakuma
,
J.
, &
Takeuchi
,
I.
(
2016
).
Secure approximation guarantee for cryptographically private empirical risk minimization
. In
Proceedings of the 8th Asian Conference on Machine Learning
(pp.
126
141
).
Wang
,
J.
,
Wonka
,
P.
, &
Ye
,
J.
(
2014
).
Scaling SVM and least absolute deviations via exact data reduction
. In
Proceedings of the International Conference on Machine Learning
(pp.
523
531
).
Wang
,
J.
,
Zhou
,
J.
,
Wonka
,
P.
, &
Ye
,
J.
(
2013
). Lasso screening rules via dual polytope projection. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
1070
1078
).
Red Hook, NY
:
Curran
.
Weinberger
,
K. Q.
, &
Saul
,
L. K.
(
2009
).
Distance metric learning for large margin nearest neighbor classification
.
Journal of Machine Learning Research
,
10
,
207
244
.
Xiang
,
Z. J.
,
Wang
,
Y.
, &
,
P. J.
(
2017
).
Screening tests for lasso problems
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
39
(
5
),
1008
1027
.
Xing
,
E. P.
,
Jordan
,
M. I.
,
Russell
,
S. J.
, &
Ng
,
A. Y.
(
2003
). Distance metric learning with application to clustering with side-information. In
S.
Becker
,
S.
Thrun
, &
K.
Obermayer
(Eds.),
Advances in neural information processing systems
,
15
(pp.
521
528
).
Cambridge, MA
:
MIT Press
.
Yang
,
H.
(
1993
).
Conjugate gradient methods for the Rayleigh quotient minimization of generalized eigenvalue problems
.
Computing
,
51
(
1
),
79
94
.
Zhang
,
W.
,
Hong
,
B.
,
Liu
,
W.
,
Ye
,
J.
,
Cai
,
D.
,
He
,
X.
, &
Wang
,
J.
(
2016
).
Scaling up sparse support vector machines by simultaneous feature and sample reduction
.
arXiv:1607.06996
.
Zhou
,
Q.
, &
Zhao
,
Q.
(
2015
).
Safe subspace screening for nuclear norm regularized least squares problems
. In
Proceedings of the International Conference on Machine Learning
(pp.
1103
1112
).
Zimmert
,
J.
,
de Witt
,
C. S.
,
Kerg
,
G.
, &
Kloft
,
M.
(
2015
).
Safe screening for support vector machines
. In
NIPS 2015 workshop on optimization in machine learning
.
Red Hook, NY
:
Curran
.