## Abstract

Distance metric learning has been widely used to obtain the optimal distance function based on the given training data. We focus on a triplet-based loss function, which imposes a penalty such that a pair of instances in the same class is closer than a pair in different classes. However, the number of possible triplets can be quite large even for a small data set, and this considerably increases the computational cost for metric optimization. In this letter, we propose safe triplet screening that identifies triplets that can be safely removed from the optimization problem without losing the optimality. In comparison with existing safe screening studies, triplet screening is particularly significant because of the huge number of possible triplets and the semidefinite constraint in the optimization problem. We demonstrate and verify the effectiveness of our screening rules by using several benchmark data sets.

## 1 Introduction

Using an appropriate distance function is essential for various machine learning tasks. For example, the performance of a $k$-nearest neighbor ($k$-NN) classifier, one of the most standard classification methods, depends crucially on the distance between different input instances. The simple Euclidean distance is usually employed, but it is not necessarily optimal for a given data set and task. Thus, the adaptive optimization of the distance metric based on supervised information is expected to improve the performance of machine learning methods including $k$-NN.

However, the set of triplets $T$ is quite large even for a small data set. For example, in a two-class problem with 100 instances in each class, the number of possible triplets is 1,980,000. Because processing a huge number of triplets is computationally prohibitive, a small subset is often used in practice (Weinberger & Saul, 2009; Shi, Bellet, & Sha, 2014; Capitaine, 2016). Typically, a subset of triplets is selected by using the neighbors of each training instance. For $n$ training instances, Shi et al. (2014) selected only $30n$ triplets, and Weinberger and Saul (2009) selected at most $O(kn2)$ triplets, where $k$ is a prespecified constant. However, the effect on the final accuracy of these heuristic selections is difficult to know beforehand. Jain, Mason, and Nowak (2017) theoretically analyzed a probabilistic generalization error bound for a random subsampling strategy of triplets. Their analysis revealed the sample complexity of metric learning, but the tightness of the bound is not clear and they did not demonstrate the practical use of determining the required number of triplets. For ordinal data embedding, Jamieson and Nowak (2011) showed a lower bound of required triplets $\Omega (dnlogn)$ to determine the embedding, but the tightness of this bound is also not known. Further, the applicability of the analysis to metric learning was not clarified.

Our safe triplet screening enables the identification of triplets that can be safely removed from the optimization problem without losing the optimality of the resulting metric. This means that our approach can accelerate the optimization of time-consuming metric learning with the guarantee of optimality. Figure 1 shows a schematic illustration of safe triplet screening.

Our approach is inspired by the safe feature screening of Lasso (Ghaoui, Viallon, & Rabbani, 2010), in which unnecessary features are identified by the following procedure:

Step 1: Construct a bounded region in which the optimal dual solution is guaranteed to exist.

Step 2: Given the bound created by step 1, remove features that cannot be selected by Lasso.

This procedure is useful to mitigate the optimization difficulty of Lasso for high-dimensional problems; thus, many papers propose a variety of approaches to create bounded regions for obtaining a tighter bound that increases screening performance (Wang, Zhou, Wonka, & Ye, 2013; Liu, Zhao, Wang, & Ye, 2014; Fercoq, Gramfort, & Salmon, 2015; Xiang, Wang, & Ramadge, 2017). As another direction of research, the screening idea was applied to other learning methods, including support vector machine nonsupport vector screening (Ogawa, Suzuki, & Takeuchi, 2013), nuclear norm regularization subspace screening (Zhou & Zhao, 2015), and group Lasso group screening (Ndiaye, Fercoq, Gramfort, & Salmon, 2016).

Based on the safe feature screening techniques, we build the procedure of our safe triplet screening as follows:

Step 1: Construct a bounded region in which the optimal solution $M\u2605$ is guaranteed to exist.

Step 2: For each triplet $(i,j,l)\u2208T$, verify the possible loss function value under the condition created by step 1.

We show that as a result of step 2, we can reduce the size of the metric learning optimization problem, by which the computational cost of the optimization can be drastically reduced. Although a variety of extensions of safe screening have been studied in the machine learning community (Lee & Xing, 2014; Wang, Wonka, & Ye, 2014; Zimmert, de Witt, Kerg, & Kloft, 2015; Zhang et al., 2016; Ogawa et al., 2013; Okumura, Suzuki, & Takeuchi, 2015; Shibagaki, Karasuyama, Hatano, & Takeuchi, 2016; Shibagaki, Suzuki, Karasuyama, & Takeuchi, 2015; Nakagawa, Suzumura, Karasuyama, Tsuda, & Takeuchi, 2016; Takada, Hanada, Yamada, Sakuma, & Takeuchi, 2016; Hanada, Shibagaki, Sakuma, & Takeuchi, 2018), to the best of our knowledge, no studies have considered screening for metric learning. Compared with existing studies, our safe triplet screening is particularly significant due to the huge number of possible triplets and the semidefinite constraint. Our technical contributions are summarized as follows:

We derive six spherical regions in which the optimal $M\u2605$ must lie and analyze their relationships.

We derive three types of screening rules, each of which employs a different approach to the semidefinite constraint.

We derive efficient rule evaluation for a special case when $M$ is a diagonal matrix.

We build an extension for the regularization path calculation.

We further demonstrate the effectiveness of our approach based on several benchmark data sets with a huge number of triplets.

This letter is organized as follows. In section 2, we define the optimization problem of large-margin metric learning. In section 3, we first derive six bounds containing optimal $M\u2605$ for the subsequent screening procedure. Section 4 derives the rules and constructs our safe triplet screening. The computational cost for the rule evaluation is analyzed in section 5. Extensions are discussed in section 6, in which an algorithm specifically designed for the regularization path calculation, and a special case, in which $M$ is a diagonal matrix, are considered. In section 7, we present the evaluation of our approach through numerical experiments. Section 8 concludes.

### 1.1 Notation

## 2 Preliminary

### 2.1 Triplet-Based Loss Function

### 2.2 Primal and Dual Formulation of Triplet-Based Distance Metric Learning

The nonlinear semidefinite programming problem of RTLM can be solved by gradient methods including the primal-based (Weinberger & Saul, 2009) and dual-based approaches (Shen, Kim, Liu, Wang, & Van Den Hengel, 2014). However, the amount of computation may be prohibitive because of the large number of triplets. The naive calculation of the objective function requires $O(d2|T|)$ computations for both the primal and the dual cases.

### 2.3 Reduced-Size Optimization Problem

The loss term for $R^$ is removed because it does not produce any penalty at the optimal solution.

The loss term for $L^$ is fixed at the linear part of the loss function by which the sum over triplets can be calculated beforehand (the last two terms).

## 3 Spherical Bound

As we will see, our safe triplet screening is derived by using a spherical region that contains the optimal $M\u2605$. In this section, we show that six variants of the regions are created by three types of different approaches. Note that the proofs for all the theorems appear in the appendixes.

### 3.1 Gradient Bound

We first introduce a hypersphere, which we name gradient bound (GB), because the center and radius of the hypersphere are represented by the subgradient of the objective function:

### 3.2 Projected Gradient Bound

Even when we substitute the optimal $M\u2605$ into the reference solution $M$, the radius of the GB is not guaranteed to be 0. By projecting the center of GB onto the feasible region (i.e., a semidefinite cone), another GB-based hypersphere can be derived, which has a radius converging to 0 at the optimal. We refer to this extension as projected gradient bound (PGB); a schematic illustration is shown as Figure 2a. In Figure 2a, the center of the GB $QGB$ (the abbreviation of $QGB(M)$) is projected onto the semidefinite cone, which becomes the center of PGB $Q+GB$. The sphere of PGB can be written as

The proof is in appendix D. PGB contains the projections onto the positive and the negative semidefinite cone in the center and the radius, respectively. These projections require the eigenvalue decomposition of $M-12\lambda \u2207P\lambda (M)$. This decomposition, however, only needs to be performed once to evaluate the screening rules of all the triplets. In the standard optimization procedures of RTLM, including Weinberger and Saul (2009), the eigenvalue decomposition of the $d\xd7d$ matrix is calculated in every iterative cycle, and thus, the computational complexity is not increased by PGB.

The following theorem shows a superior convergence property of PGB compared to GB:

There exists a subgradient $\u2207P\lambda (M\u2605)$ such that the radius of PGB is 0.

For the hinge loss, which is not differentiable at the kink, the optimal dual variables provide subgradients that set the radius equal to 0. This theorem is an immediate consequence of the proof in appendix I, which is the proof for the relation between PGB and the other bound derived in section 3.4.

From Figure 2a, we see that the half space $\u2329-Q-GB,X\u232a\u22650$, where $Q-GB=QGB-Q+GB$, can be used as a linear relaxation of the semidefinite constraint for the linear constraint rule in section 4.3. Interestingly, the GB with this linear constraint is tighter than the PGB. This is proved in appendix D, which gives the proof of the PGB.

### 3.3 Duality Gap Bound

In this section, we describe the duality gap bound (DGB) in which the radius is represented by the duality gap:

The proof is in appendix E. Because the radius is proportional to the square root of the duality gap, DGB obviously converges to 0 at the optimal solution (see Figure 2b). The DGB, unlike the previous bounds, requires a dual feasible solution. This means that when a primal-based optimization algorithm is employed, we need to create a dual feasible solution from the primal feasible solution. A simple way to create a dual feasible solution is to substitute the current $M$ into $M\u2605$ of equation 2.6. When a dual-based optimization algorithm is employed, a primal feasible solution can be created by equation 2.4.

For the DGB, we can derive a tighter bound, the constrained duality gap bound (CDGB), with an additional constraint. However, except for a special case (dynamic screening with a dual solver), additional transformation of the reference solution is necessary, which can deteriorate the duality gap. See appendix F for further details.

### 3.4 Regularization Path Bound

In Wang et al. (2014), a hypersphere is proposed specifically for the regularization path, in which the optimization problem should be solved for a sequence of $\lambda $s. Suppose that $\lambda 0$ has already been optimized and it is necessary to optimize $\lambda 1$. Then the same approach as Wang et al. (2014) is applicable to our RTLM, which derives a bound depending on the optimal solution for $\lambda 0$ as a reference solution:

The proof is in appendix G. We refer to this bound as the regularization path bound (RPB).

The RPB requires the theoretically optimal solution $M0\u2605$, which is numerically impossible. Furthermore, because the reference solution is fixed on $M0\u2605$, the RPB can be performed only once for a specific pair of $\lambda 0$ and $\lambda 1$ even if the optimal $M0\u2605$ is available. The other bounds can be performed multiple times during the optimization by regarding the current approximate solution as a reference solution.

### 3.5 Relaxed Regularization Path Bound

The proof is in appendix H. The intuition behind the RRPB is shown in Figure 2d, in which the approximation error for the center of the RPB is depicted. In the theorem, the RRPB also considers the error in the radius, although it is not illustrated in the figure for simplicity. To the best of our knowledge, this approach has not been introduced in other existing screening studies.

^{5}(DGB) as follows:

### 3.6 Analytical Relation between Bounds

The following theorem describes the relation between PGB and RPB:

(Relation between PGB and RPB). Suppose that the optimal solution $M0\u2605$ for $\lambda 0$ is substituted into the reference solution $M$ of PGB. Then there exists a subgradient $\u2207P\lambda 1(M0\u2605)$ by which the PGB and RPB provide the same center and radius for $M1\u2605$.

The proof is presented in appendix I. The following theorem describes the relation between the DGB and RPB:

(Relation between DGB and RPB). Suppose that the optimal solutions $M0\u2605,\alpha 0\u2605$, and $\Gamma 0\u2605$ for $\lambda 0$ are substituted into the reference solutions $M,\alpha $, and $\Gamma $ of the DGB. Then the radius of DGB and RPB for $\lambda 1$ has a relation $rDGB=2rRPB$, and the hypersphere of RPB is included in the hypersphere of DGB.

The proof is in appendix J. Figure 2c illustrates the relation between the DGB and RPB, which shows the theoretical advantage of the RPB for the regularization path setting.

Using the analytical results obtained thus far, we summarize relative relations between the bounds as follows. First, we consider the case in which the reference solution is optimal for $\lambda 0$ in the regularization path calculation. We obviously see $rGB\u2265rPGB$ from Figure 2a, and from theorems ^{8} and ^{9}, we see $DGB\u2287PGB=RPB=RRPB$. When the reference solution is an approximate solution in the regularization path calculation, we see only $rGB\u2265rPGB$. For dynamic screening in which the reference solution is always an approximate solution, we see $rGB\u2265rPGB$, and we also see $RRPB=DGB$ when $\epsilon $ is determined by DGB as written in equation 3.2.

Other properties of the bounds are summarized in Table 1. Although DGB and RRPB (RPB + DGB) have the same properties, our empirical evaluation in section 7.2 shows that RRPB often outperforms DGB in the regularization path calculation. (Note that although CDGB also has the same properties as the above two methods, we omit it in the empirical evaluation because of its practical limitation, as we see in section 3.3.)

. | Radius Convergence . | Dynamic Screening . | Reference Solution . | Exact Optimality of Reference . |
---|---|---|---|---|

GB | Can be $>0$ | Applicable | Primal | Not necessary |

PGB | $=0a$ | Applicable | Primal | Not necessary |

DGB | $=0$ | Applicable | Primal/dual | Not necessary |

CDGB | $=0$ | Applicable | Primal/dual | Not necessary |

RPB | NA | Not applicable | Primal | Necessary |

RRPB | $=0$ | Applicable | Primal/dual | Not necessary |

(RPB + DGB) |

. | Radius Convergence . | Dynamic Screening . | Reference Solution . | Exact Optimality of Reference . |
---|---|---|---|---|

GB | Can be $>0$ | Applicable | Primal | Not necessary |

PGB | $=0a$ | Applicable | Primal | Not necessary |

DGB | $=0$ | Applicable | Primal/dual | Not necessary |

CDGB | $=0$ | Applicable | Primal/dual | Not necessary |

RPB | NA | Not applicable | Primal | Necessary |

RRPB | $=0$ | Applicable | Primal/dual | Not necessary |

(RPB + DGB) |

Note: The radius convergence indicates a radius when the reference solution is the optimal solution.

$a$For the hinge loss ($\gamma =0$) case, a subgradient is required to be selected appropriately for achieving this convergence.

## 4 Safe Rules for Triplets

Our safe triplet screening can reduce the number of triplets by identifying a part of $L\u2605$ and $R\u2605$ before solving the optimization problem based on the following procedure:

Step 1: Identify the spherical region in which the optimal solution $M\u2605$ lies, based on the current feasible solution we refer to as the reference solution.

Step 2: For each triplet $(i,j,l)\u2208T$, verify the possibility of $(i,j,l)\u2208L\u2605$ or $(i,j,l)\u2208R\u2605$ under the condition that $M\u2605$ is in the region.

In section 3, we showed that there exist a variety of approaches to creating the spherical region for step 1. In this section, we describe the procedure of step 2 given the sphere region.

### 4.1 Spherical Rule

### 4.2 Spherical Rule with a Semidefinite Constraint

Based on the connection shown above, the rule evaluation, equation R2, with the semidefinite constraint is summarized as follows:

Select an arbitrary feasible solution $X0\u2208BPSD$. If $\u2329X0,Hijl\u232a\u22641$, we immediately see that the condition of equation R2 is not satisfied for the triplet $(i,j,l)$. Otherwise, go to the next step. Note that in this case, assumption 4.2 is confirmed because $\u2329X0,Hijl\u232a>1$).

Solve SDLS. If the optimal value satisfies $\u2225X\u2605-Q\u2225F2>r2$, the triplet $(i,j,l)$ is guaranteed to be in $R\u2605$.

Although the computation of $[Q+yHijl]+$ requires an eigenvalue decomposition, this computational requirement can be alleviated when the center $Q$ of the hypersphere is positive semidefinite. The definition determines that $Hijl$ has at most one negative eigenvalue, and then $Q+yHijl$ also has at most one negative eigenvalue. Let $\lambda min$ be the negative (minimum) eigenvalue of $Q+yHijl$, and $qmin$ be the corresponding eigenvector. The projection $[Q+yHijl]+$ can be expressed as $[Q+yHijl]+=(Q+yHijl)-\lambda minqminqmin\u22a4$. Computation of the minimum eigenvalue and eigenvector is much easier than the full eigenvalue decomposition (Lehoucq & Sorensen, 1996).

### 4.3 Spherical Rule with Linear Constraint

A necessary condition for performing our screening is that a loss function needs to have at least one linear region or a zero region. For example, the logistic loss cannot be used for screening because it has neither a linear nor a zero region.

## 5 Computations

Algorithm 1 shows the detailed procedure of our safe screening with simple fixed step-size gradient descent. (Note that any other optimization algorithm can be combined with our screening procedure.) In the algorithm, for every freq iteration of the gradient descent, the screening rules are evaluated by using the current solution $M$ as the reference solution. As the quality of the approximate solution $M$ improves, the larger the number of triplets that can be removed from $T$. Thus, the quality of the initial solution affects the efficiency. In the case of the regularization path calculation, in which RTLM is solved for a sequence of $\lambda $s, a reasonable initial solution is the approximate solution to the previous $\lambda $. We discuss a further extension specific to the regularization path calculation in section 6.1.

Considering the computational cost of the screening procedure of algorithm 1, the rule evaluation (step 2) described in section 4 is often dominant, because the rule needs to be evaluated for each one of the triplets. The sphere, constructed in step 1, can be fixed during the screening procedure as long as the reference solution is fixed.

To evaluate the spherical rule, equation 4.1, given the center $Q$ and the radius $r$, the inner product $\u2329Hijl,Q\u232a$ and the norm $\u2225Hijl\u2225F$ need to be evaluated. The inner product $\u2329Hijl,Q\u232a$ can be calculated in $O(d2)$ operations because it is expanded as a sum of quadratic forms: $\u2329Hijl,Q\u232a=(xi-xl)\u22a4Q(xi-xl)-(xi-xj)\u22a4Q(xi-xj)$. Further, we can reuse this term from the objective function $P\lambda (M)$ calculation in the case of the DGB, RPB, and RRPB. The norm $\u2225Hijl\u2225F$ can be calculated in $O(d)$ operations, and this is constant throughout the optimization process. Thus, for the DGB, RPB, or RRPB, it is possible to reduce the additional computational cost of the spherical rule for $(i,j,l)$ to $O(1)$ by calculating $\u2225Hijl\u2225F$ beforehand. The computational cost of the spherical rule with the semidefinite constraint (see section 4.2) is that of the SDLS algorithm. The SDLS algorithm needs $O(d3)$ because of the eigenvalue decomposition in every iterative cycle, which may considerably increase the computational cost. The computational cost of the spherical rule with the linear constraint (see section 4.3) is $O(d2)$.

## 6 Extensions

### 6.1 Range-Based Extension of Triplet Screening

The screening rules presented in section 4 relate to the problem of a fixed $\lambda $. In this section, we regard a screening rule as a function of $\lambda $ to derive a range of $\lambda $s in which the screening rule is guaranteed to be satisfied. This is particularly useful for calculating the regularization path for which we need to optimize the metric for a sequence of $\lambda $s. If a screening rule is satisfied for a triplet $(i,j,l)$ in a range $(\lambda a,\lambda b)$, we can fix the triplet $(i,j,l)$ in $L^$ or $R^$ as long as $\lambda $ is in $(\lambda a,\lambda b)$, without computing the screening rules.

#### 6.1.1 Deriving the Range

Refer to section L.2 for the proof. The computational procedure for range-based screening is shown in algorithm 2.

#### 6.1.2 Consideration for Range Extension with Other Bounds

As shown in equation 3.1, the RRPB is based on the optimality $\epsilon $ for the current $\lambda 0$, and does not depend on the optimality for $\lambda 1$, which is regarded as $\lambda $ in the general form of equations 6.1 and 6.2. Because of this property, the RRPB is particularly suitable to range-based screening among the spheres we derived thus far. To calculate $\epsilon $, equation 3.2 for the RRPB, the duality gap $P\lambda 0(M0)-D\lambda 0(\alpha 0,\Gamma 0)$ is required. Instead of the original $P\lambda 0(M0)-D\lambda 0(\alpha 0,\Gamma 0)$, we can use problems with a reduced size, $P\u02dc\lambda 0(M0)-D\u02dc\lambda 0(\alpha 0,\Gamma 0)$, for efficient computation, where $D\u02dc\lambda 0$ is the dual objective in which $\alpha i=0$ for $i\u2208R^$ and $\alpha i=1$ for $i\u2208L^$ are fixed. Because the reduced-size problem shares exactly the same optimal solution with the original problems, this gap also provides a valid bound. As a result, we can avoid computing the sum over all triplets in $T$ (e.g., to calculate the loss term in the original primal) to evaluate a bound.

### 6.2 Screening with Diagonal Constraint

When the matrix $M$ is constrained to be a diagonal matrix, metric learning is reduced to feature weighting in which the Mahalanobis distance, equation 2.1, simply adapts a weight of each feature without combining different dimensions. Although correlation in different dimensions is not considered, this simpler formulation is useful to avoid a large computational cost for high-dimensional data mainly because of the following two reasons:

The number of variables in the optimization decreases from $d2$ to $d$.

The semidefinite constraint for a diagonal matrix is reduced to the nonnegative constraint of diagonal elements.

Both properties are also beneficial for efficient screening rule evaluation; in particular, the second property makes the screening rule with the semidefinite constraint easier to evaluate.

For $\alpha =0$, the following theorem is obtained.

Based on the theorems, the optimal solution of equation P4 can be calculated analytically. The detail of the procedure is shown in section N.3, which requires $O(d2)$ computations. Although this procedure obtains the solution by using the fixed steps of analytical calculations, for larger values of $d$, iterative optimization algorithms can be faster. For example, we can apply the SDLS dual ascent to problem P4 in which each iterative step takes $O(d)$.

### 6.3 Applicability to More General Formulation

Note that pairwise-, triplet- and quadruplet-loss functions can be used simultaneously, and safe screening can be applied to remove any of those loss terms.

## 7 Experiment

. | #dimension . | #sample . | #classes . | $k$ . | #triplet . | $\lambda max$ . | $\lambda min$ . |
---|---|---|---|---|---|---|---|

Iris | 4 | 150 | 3 | $\u221e$ | 546,668 | 1.3e$+$7 | 2.3e$+$1 |

Wine | 13 | 178 | 3 | $\u221e$ | 910,224 | 2.0e$+$7 | 5.1e$+$1 |

Segment | 19 | 2310 | 7 | 20 | 832,000 | 2.5e$+$6 | 4.2e$+$0 |

Satimage | 36 | 4435 | 6 | 15 | 898,200 | 1.0e$+$7 | 8.8e$+$0 |

Phishing | 68 | 11,055 | 2 | 7 | 487,550 | 5.0e$+$3 | 2.0e$-$1 |

SensIT Vehicle | 100 | 78,823 | 3 | 3 | 638,469 | 1.0e$+$4 | 2.9e$+$0 |

a9a | 16$a$ | 32,561 | 2 | 5 | 732,625 | 1.2e$+$5 | 3.1e$+$2 |

Mnist | 32$a$ | 60,000 | 10 | 5 | 1,350,025 | 7.0e$+$3 | 9.6e$-$1 |

Cifar10 | 200$a$ | 50,000 | 10 | 2 | 180,004 | 2.0e$+$3 | 3.3e$+$1 |

Rcv1.multiclass | 200$b$ | 15,564 | 53 | 3 | 126,018 | 3.0e$+$2 | 6.0e$-$4 |

. | #dimension . | #sample . | #classes . | $k$ . | #triplet . | $\lambda max$ . | $\lambda min$ . |
---|---|---|---|---|---|---|---|

Iris | 4 | 150 | 3 | $\u221e$ | 546,668 | 1.3e$+$7 | 2.3e$+$1 |

Wine | 13 | 178 | 3 | $\u221e$ | 910,224 | 2.0e$+$7 | 5.1e$+$1 |

Segment | 19 | 2310 | 7 | 20 | 832,000 | 2.5e$+$6 | 4.2e$+$0 |

Satimage | 36 | 4435 | 6 | 15 | 898,200 | 1.0e$+$7 | 8.8e$+$0 |

Phishing | 68 | 11,055 | 2 | 7 | 487,550 | 5.0e$+$3 | 2.0e$-$1 |

SensIT Vehicle | 100 | 78,823 | 3 | 3 | 638,469 | 1.0e$+$4 | 2.9e$+$0 |

a9a | 16$a$ | 32,561 | 2 | 5 | 732,625 | 1.2e$+$5 | 3.1e$+$2 |

Mnist | 32$a$ | 60,000 | 10 | 5 | 1,350,025 | 7.0e$+$3 | 9.6e$-$1 |

Cifar10 | 200$a$ | 50,000 | 10 | 2 | 180,004 | 2.0e$+$3 | 3.3e$+$1 |

Rcv1.multiclass | 200$b$ | 15,564 | 53 | 3 | 126,018 | 3.0e$+$2 | 6.0e$-$4 |

Note: #triplet and $\lambda min$ are the average value for subsampled random trials.

$a$The dimension was reduced by AutoEncoder.

$b$The dimension was reduced by PCA.

### 7.1 Comparing Rules

We first validate the screening performance (screening rate and CPU time) of each screening rule introduced in section 4 by using algorithm 2 without the range-based screening process. Here, the screening rate is defined by #screened triplets/$|{(i,j,l)\u2208\tau \u2223\u2329Hijl,M^\u232a>1or\u2329Hijl,M^\u232a<1-\gamma}|$ where $M^$ is the solution after convergence.

#### 7.1.1 GB-Based Rules

Here we use the GB and PGB as spheres, and we observe the effect of the semidefinite constraint in the rules. As a representative result, Figure 6a compares the performance of the rules by using segment data.

First, except for the GB, the rules maintain a high screening rate for the entire regularization path, as shown in the top left plot. Note that this rate is only for regularization path screening, meaning that dynamic screening can further increase the screening rate during the optimization, as discussed in the section 7.1.2. The bottom left plot of the same figure shows that PGB and GB+Linear are the most efficient and achieved CPU times approximately 2 to 10 times faster than the naive optimization. The screening rate of the GB was severely reduced along the latter half of the regularization path. As illustrated in Figure 2a, the center of the GB can be external to the semidefinite cone by which the sphere of GB contains a larger proportion of the region violating the constraint $M\u2ab0O$, compared with the spheres with their center inside the semidefinite cone. This causes performance deterioration particularly for smaller values of $\lambda $, because the minimum of the loss term is usually outside the semidefinite cone.

The screening rates of GB+Linear and GB+Semidefinite are slightly higher than that of the PGB (the plot on the right), which can be seen from their geometrical relation illustrated in Figure 2a. GB+Semidefinite achieved the highest screening rate, but eigenvalue decomposition is necessary to repeatedly perform the calculation in SDLS, which resulted in the CPU time increasing along the latter half of the path. Although PGB+Semidefinite is also tighter than PGB, the CPU time increased from approximately $-log10(\lambda )\u2248-4$ to $-3$. Because the center of PGB is positive semidefinite, only the minimum eigenvalue is required (see section 4.2), but it can still increase the CPU time.

Among the screening methods compared here, our empirical analysis suggests that the use of the spherical rule with PGB, in which a semidefinite constraint is implicitly incorporated in the projection process, is the most cost-effective. We did not observe that the other approach to considering the semidefinite (or relaxed linear) constraint in the rule substantially outperforms PGB in terms of CPU time despite its high screening rate. We observed the same tendency for DGB. The screening rate did not change markedly even if the semidefinite constraint was explicitly considered.

#### 7.1.2 DGB-Based Rules

Next, by using the DGB, we compared the performance of the three rules presented in section 4. Figure 6b shows the results, which are similar to those obtained for the GB, shown in Figure 6a. The semidefinite and the linear constraint slightly improve the rate. However, the large computational cost for screening with the semidefinite constraint caused the overall CPU time to increase. Therefore, although the linear constraint is much easier to evaluate, the CPU time was almost the same as that required for the DGB because of the slight improvement in the screening rate.

### 7.2 Comparing Bounds

Here we compare the screening performance (screening rate and CPU time) of each bound introduced in section 3 by using algorithm 2 without the range-based screening process. We do not use RPB because it needs the strictly optimal previous solution.

Based on the results in the previous section, we employed the spherical rule. The result obtained for the phishing data set is shown in Figure 7. The screening rate of the GB (top right) again decreased from the middle of the horizontal axis compared with the other spheres. The other spheres also have lower screening rates for small values of $\lambda $s. As mentioned in section 6.1, the radii of GB, DGB, RPB, and RRPB have the form $r2=a+b1\lambda +c1\lambda 2$, meaning that if $\lambda \u21920$, then $r\u2192\u221e$. In the case of the PGB, although the dependency on $\lambda $ cannot be written explicitly, the same tendency was observed. We see that the PGB and RRPB have similar results as suggested by theorem ^{8}, and the screening rate of the DGB is lower than that of the RRPB, as suggested by theorem ^{9}. A comparison of the PGB and RRPB indicated that the former achieved a higher screening rate, but the latter is more efficient with less CPU time, as shown in the plot at the bottom right, because the PGB requires a matrix inner product calculation for each triplet. Bounds other than the GB are more than twice as fast as the naive calculation for most values of $\lambda $.

A comparison of the dynamic screening rate (the three plots on the left in Figure 7) of PGB and RRPB shows that the rate of PGB is higher. In terms of the regularization path screening (top right), RRPB and PGB have similar screening rates, but PGB has a higher dynamic screening rate. Along the latter half of the regularization path, the number of gradient descent iterations increases; consequently, the dynamic screening significantly affects the CPU time, and the PGB becomes faster despite the additional computation it requires to compute the inner product.

We further evaluate the performance of the range-based extension described in section 6.1. Figure 8 shows the rate of the range-based screening for the segment data set. The figure shows that a wide range of $\lambda $ can be screened, particularly for small values of $\lambda $; although the range is smaller for large values of $\lambda $, than for the small values, a high screening rate is observed when $\lambda $ approaches $\lambda 0$. A significant advantage of this approach is that for those triplets screened by using the specified range, we no longer need to evaluate the screening rule as long as $\lambda $ is within the range.

The total CPU time for the regularization path is shown in Figure 9. In addition to GB, PGB, DGB, and RRPB, we further evaluate the performance when PGB and RRPB are used simultaneously. The use of two rules can improve the screening rate; however, additional computations are required to evaluate the rule. In the figure, for four out of six data sets, the PGB+RRPB combination requires the least CPU time.

### 7.3 Evaluating the Practical Efficiency

We next considered a computationally more expensive setting to evaluate the effectiveness of the safe screening approach in a practical situation. To investigate the regularization path more precisely, we set a finer grid of regularization parameters defined as $\lambda t=0.99\lambda t-1$. We also incorporated the well-known active set heuristics to conduct our experiments on larger data sets. Note that because of the above differences, the computational time shown here cannot be directly compared with the results in sections 7.1 and 7.2. The active set method uses only a subset of triplets of which the loss is greater than 0 as the active set. The gradient is calculated by using only the active set, and the overall optimality is confirmed when the iteration converges. We employed the active set update strategy shown by Weinberger and Saul (2009), in which the active set is updated once every ten iterative cycles.

Table 3 compares the CPU time for the entire regularization path. Based on the results in the previous section, we employed RRPB and RRPB+PGB (evaluating rules based on both spheres) for triplet screening. Further, the range-based screening described in section 6.1 is also performed using RRPB, for which we evaluate the range at the beginning of the optimization for each $\lambda $, as shown in algorithm 2. Our safe triplet screening accelerates the optimization process by up to 10 times compared to the simple active set method. The results for higher-dimensional data sets with a diagonal $M$ are presented in section 7.4.1.

Method\Data Set . | phishing . | SensIT . | a9a . | mnist . | cifar10 . | rcv . |
---|---|---|---|---|---|---|

ActiveSet | 7989.5 | 16,352.1 | 758.7 | 3788.1 | 11085.7 | 94996.3 |

ActiveSet+RRPB | 2126.2 | 3555.6 | 70.1 | 871.1 | 1431.3 | 43174.9 |

ActiveSet+RRPB+PGB | 2133.2 | 3046.9 | 72.1 | 897.9 | 1279.7 | 38231.1 |

Method\Data Set . | phishing . | SensIT . | a9a . | mnist . | cifar10 . | rcv . |
---|---|---|---|---|---|---|

ActiveSet | 7989.5 | 16,352.1 | 758.7 | 3788.1 | 11085.7 | 94996.3 |

ActiveSet+RRPB | 2126.2 | 3555.6 | 70.1 | 871.1 | 1431.3 | 43174.9 |

ActiveSet+RRPB+PGB | 2133.2 | 3046.9 | 72.1 | 897.9 | 1279.7 | 38231.1 |

Note: Results in bold indicate the fastest method.

### 7.4 Empirical Evaluation of Three Special Cases

Here we evaluate three special cases of our formulation: nonsmoothed hinge loss, the Mahalanobis distance with a diagonal matrix, and dynamic screening for a certain value of $\lambda $.

#### 7.4.1 Nonsmoothed Hinge Loss

In previous experiments, we used the smoothed hinge loss function $\gamma =0.05$. However, the hinge loss function $\gamma =0$ is also widely used. Figure 10 shows the screening result of the PGB spherical rule for the segment data. Here, the loss function of RTLM is the hinge loss function, and the other settings are the same as those of the experiments in the main text. The results show that PGB achieved a high screening rate and that the CPU time substantially improved.

#### 7.4.2 Learning with Higher-Dimensional Data Using Diagonal Matrix

Here we evaluate the screening performance when the matrix $M$ is confined to being a diagonal matrix. Based on the same setting as section 7.3, comparison with the ActiveSet method is shown in Table 4. We used RRPB and RRPB+PGB, both of which largely reduced the CPU time. Attempts to process the Gisette data set, which has the largest dimension, 5,000, with the active set method were unsuccessful and the method did not terminate even after 250,000 s.

Method\Data Set . | USPS . | Madelon . | Colon-Cancer . | Gisette . |
---|---|---|---|---|

ActiveSet | 2485.5 | 7005.8 | 3149.8 | – |

ActiveSet+RRPB | 326.7 | 593.4 | 632.2 | 133,870.0 |

ActiveSet+RRPB+PGB | 336.6 | 562.4 | 628.2 | 127,123.8 |

#dimension | 256 | 500 | 2000 | 5000 |

#samples | 7291 | 2000 | 62 | 6000 |

#triplet | 656,200 | 720,400 | 38,696 | 1,215,225 |

$k$ | 10 | 20 | $\u221e$ | 15 |

$\lambda max$ | 1.0e+7 | 2.0e+14 | 5.0e+7 | 4.5e+8 |

$\lambda min$ | 1.9e+3 | 4.7e+11 | 7.0e+3 | 2.1e+3 |

Method\Data Set . | USPS . | Madelon . | Colon-Cancer . | Gisette . |
---|---|---|---|---|

ActiveSet | 2485.5 | 7005.8 | 3149.8 | – |

ActiveSet+RRPB | 326.7 | 593.4 | 632.2 | 133,870.0 |

ActiveSet+RRPB+PGB | 336.6 | 562.4 | 628.2 | 127,123.8 |

#dimension | 256 | 500 | 2000 | 5000 |

#samples | 7291 | 2000 | 62 | 6000 |

#triplet | 656,200 | 720,400 | 38,696 | 1,215,225 |

$k$ | 10 | 20 | $\u221e$ | 15 |

$\lambda max$ | 1.0e+7 | 2.0e+14 | 5.0e+7 | 4.5e+8 |

$\lambda min$ | 1.9e+3 | 4.7e+11 | 7.0e+3 | 2.1e+3 |

Notes: The results in bold indicate the fastest method. The Gisette data set did not produce results by ActiveSet because of the time limitation.

#### 7.4.3 Dynamic Screening for Fixed $\lambda $

Here, we evaluate the performance of dynamic screening for a fixed $\lambda $. For $\lambda $, we used $\lambda min$ in Table 2 for which the screening rate was relatively low in our results thus far (e.g., see Figure 6a). Figure 11 compares the computational time of the naive approach without screening and with the dynamic screening shown in algorithm 1. The plots in Figure 11a show that dynamic screening accelerates the learning process. The plots in Figure 11b show the performance of the active set strategy, indicating that the combination of dynamic screening and the active set strategy is effective for further acceleration.

### 7.5 Effect of Number of Triplets on Prediction Accuracy

Finally, we examine the relation between the number of triplets contained in $T$ and the prediction accuracy of the classification. We employed the nearest-neighbor (NN) classifier to measure the prediction performance of the learned metric. The data set was randomly divided into training data (60%), validation data (20%), and test data (20%). The regularization parameter $\lambda $ changed from $105$ to 0.1 and was chosen by minimizing the validation error. The experiment was performed 10 times by randomly partitioning the data set in different ways.

The results are shown in Figure 12, which summarizes the CPU time and test error rate for different settings of the number of triplets. The horizontal axes in all four plots, a to d, represent the number of neighbors $k$ used to define the original triplet set $T$ as described at the beginning of section 7. Figure 12a shows the CPU time to calculate the entire regularization path with and without screening. Here “Without Screening” indicates the ActiveSet approach, and “With Screening” indicates the ActiveSet+RRPB approach. These results show that the learning time increases as $k$ increases, and safe triplet screening shows larger decreases in the CPU time for larger values of $k$. Figures 12b to 12d show the test error rates, each calculated by 10 NN, 20 NN, and 30 NN classifiers, respectively. In Figure 12b, the 10 NN test error is minimized at $k=6$, with screening requiring less than approximately 2,000 seconds, whereas the naive approach (Without Screening) can calculate only approximately $k=4$ in the same computational time. In Figure 12c, the 20 NN test error is minimized at $k=12$, with screening requiring approximately 4000 seconds, whereas the naıve approach can calculate only approximately $k=8$. In Figure 12d, the 30 NN test error is minimized at $k=15$, with screening requiring approximately 5000 seconds, whereas the naïve approach can calculate only approximately $k=9$. These results indicate that the number of neighbors, $k$, significantly affects the prediction accuracy, and sufficiently large $k$ is often necessary to achieve the best prediction performance.

## 8 Conclusion

We introduced safe triplet screening for large-margin metric learning. Three screening rules and six spherical bounds were derived, and the relations among them were analyzed. We further proposed a range-based extension for the regularization path calculation. Our screening technique for metric learning is particularly significant compared with other screening studies because of the large number of triplets and the semidefinite constraint. Our numerical experiments verified the effectiveness of safe triplet screening using several benchmark data sets.

## Appendix A: Dual Formulation

## Appendix B: Proof of Lemma ^{1}

## Appendix C: Proof of Theorem ^{2} (GB)

The following theorem is a well-known optimality condition for the general convex optimization problem:

*(Optimality Condition of Convex Optimization, Bertsekas, 1999).*In the minimization problem $minx\u2208Ff(x)$ where the feasible region $F$ and the function $f(x)$ are convex, and the necessary and sufficient condition that $x\u2605$ is the optimal solution is

^{14}, the following holds for the optimal solution $M\u2605$:

## Appendix D: Proof of Theorem ^{3} (PGB)

## Appendix E: Proof of Theorem ^{5} (DGB)

^{5}.$\u25a1$

## Appendix F: Constrained Duality Gap Bound (CDGB)

For the DGB, we show that if the primal and dual reference solutions satisfy equation 2.4, the radius can be $2$ times smaller. We extend the dual-based screening of SVM (Zimmert et al., 2015) for RTLM.

^{14}. Considering $GP\lambda (M\u2605)=0$ and $GD\lambda (\alpha ,\Gamma )\u2265GP\lambda (M\lambda (\alpha ,\Gamma ))$, both of which are from the definition, we obtain

We name this bound the constrained duality gap bound (CDGB), of which the radius converges to 0 at the optimal solution, because the CDGB also has a radius proportional to the square root of the duality gap. For primal-based optimizers, additional calculation is necessary for $P\lambda (M\lambda (\alpha ,\Gamma ))$, whereas dual-based optimizers calculate this term in the optimization process.

### F.1 Proof of Strong Convexity of $GP\lambda $

We first define an $m$-strongly convex function as follows:

($m$-strongly Convex Function). When $f(x)-m2x22$ is a convex function, $f(x)$ is an $m$-strongly convex function.

^{16}, to show that $GP\lambda $ is strongly convex, we need to show that the term other than $\lambda MF2$ is convex:

## Appendix G: Proof of Theorem ^{6} (RPB)

^{13}, in the dual problem, Dual1, for $\lambda 0,\lambda 1$ determines that

## Appendix H: Proof of Theorem ^{7} (RRPB)

## Appendix I: Proof of Theorem ^{8} (Relationship between PGB and RPB)

## Appendix J: Proof of Theorem ^{9} (Relationship between DGB and RPB)

## Appendix K: Proof of Theorem ^{10}

## Appendix L: Range-Based Extension

### L.1 Generalized Form of GB, DGB, RPB, and RRPB

#### L.1.1 GB

#### L.1.2 DGB

#### L.1.3 RPB

#### L.1.4 RRPB

### L.2 Proof of Theorem ^{11} (Range-Based Extension of RRPB)

$\u25a1$

## Appendix M: Calculation for Range-Based Extension Other Than RPB and RRPB

## Appendix N: Spherical Rule with Semidefinite Constraint for Diagonal Case

### N.1 Proof of Theorem ^{12}

### N.2 Proof of Theorem ^{13}

### N.3 Analytical Procedure of Rule Evaluation for the Diagonal Case

## Acknowledgments

This work was financially supported by grants from the Japanese Ministry of Education, Culture, Sports, Science and Technology awarded to I.T. (16H06538, 17H00758) and M.K. (16H06538, 17H04694); from the Japan Science and Technology Agency (JST) CREST awarded to I.T. (JPMJCR1302, JPMJCR1502) and PRESTO awarded to M.K. (JPMJPR15N2); from the Materials Research by Information Integration Initiative (MI2I) project of the Support Program for Starting Up Innovation Hub from the JST awarded to I.T. and M.K.; and from the RIKEN Center for Advanced Intelligence Project awarded to I.T.