## Abstract

Pairwise similarities and dissimilarities between data points are often obtained more easily than full labels of data in real-world classification problems. To make use of such pairwise information, an empirical risk minimization approach has been proposed, where an unbiased estimator of the classification risk is computed from only pairwise similarities and unlabeled data. However, this approach has not yet been able to handle pairwise dissimilarities. Semisupervised clustering methods can incorporate both similarities and dissimilarities into their framework; however, they typically require strong geometrical assumptions on the data distribution such as the manifold assumption, which may cause severe performance deterioration. In this letter, we derive an unbiased estimator of the classification risk based on all of similarities and dissimilarities and unlabeled data. We theoretically establish an estimation error bound and experimentally demonstrate the practical usefulness of our empirical risk minimization method.

## 1  Introduction

In supervised classification, we need a vast amount of labeled data to train our classifiers. However, it is often not easy to obtain such labels due to high labeling costs (Chapelle, Schölkopf, & Zien, 2010), privacy concerns (Warner, 1965), and social bias (Nederhof, 1985). In real-world classification problems, pairwise similarities (i.e., pairs of samples in the same class) and pairwise dissimilarities (i.e., pairs of samples in different classes) are often collected more easily than full labels of data. For example, in protein function prediction, the knowledge about similarities and dissimilarities can be obtained by experimental means as additional supervision (Klein, Kamvar, & Manning, 2002). In video object classification, knowledge of temporal relations can be used to generate pairwise labels in an algorithmic way; for example, an object staying in temporally adjacent frames must be the same, and two objects in the same frame must be different (Yan, Zhang, Yang, & Hauptmann, 2006; Zhang & Yan, 2007). To make use of such pairwise information, similar-unlabeled (SU) classification (Bao, Niu, & Sugiyama, 2018) has been proposed, where the classification risk is estimated in an unbiased fashion from only similar pairs and unlabeled data. Although their method can handle only similar data and unlabeled data, we may also obtain dissimilar pairs in practice. In such a case, we can expect that the use of dissimilarities, in addition to similarities and unlabeled data, improves the classification accuracy.

Semisupervised clustering (Wagstaff, Cardie, Rogers, & Schrödl, 2001) is a method that can incorporate both similar and dissimilar pairs into their framework, where must-link pairs (i.e., similar pairs) and cannot-link pairs (i.e., dissimilar pairs) are used to obtain meaningful clusters. Existing literature provides useful semisupervised clustering methods based on the ideas that (1) must/cannot-links are treated as constraints (Basu, Banerjee, & Mooney, 2002; Wagstaff et al., 2001; Li & Liu, 2009; Hu, Wang, Yu, & Hua, 2008), (2) clustering is performed with metrics learned by semisupervised metric learning (Xing, Jordan, Russell, & Ng, 2003; Bilenko, Basu, & Mooney, 2004; Weinberger & Saul, 2009; Davis, Kulis, Jain, Sra, & Dhillon, 2007; Niu, Dai, Yamada, & Sugiyama, 2012), and (3) missing links are predicted by matrix completion (Yi, Zhang, Jin, Qian, & Jain, 2013; Chiang, Hsieh, & Dhillon, 2015). However, the motivation of clustering, finding a meaningful cluster structure, is different from that of classification, finding a classifier that allows prediction of labels for unseen data. Therefore, applying a semisupervised clustering method to classification does not necessarily give an appropriate solution. For example, most of the semisupervised clustering methods rely on geometrical or margin-based assumptions such as the cluster assumption and manifold assumption (Basu, Davidson, & Wagstaff, 2008), and without such assumptions, semisupervised clustering methods do not work well. In addition, the objective of semisupervised clustering is usually not the minimization of the classification risk, which may lead to suboptimal performance in terms of the classification accuracy.

In contrast, more and more discriminative training approaches have been studied recently. One is contrastive representation learning (Chopra, Hadsell, & LeCun, 2005; Hadsell, Chopra, & LeCun, 2006; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013; Kiros et al., 2015; Sohn, 2016; Logeswaran & Lee, 2018; Peters et al., 2018; Oord, Li, & Vinyals, 2018; Hjelm et al., 2019; Arora, Khandeparkar, Khodak, Plevrakis, & Saunshi, 2019), which tries to obtain good data representations by bringing an anchor data point close to a given similar data point (a positive sample) and far from randomly sampled data points (negative samples). Resulting representations can be used for downstream classification. The other is the meta-classification approach (Hsu, Lv, Schlosser, Odom, & Kira, 2019; Wu et al., 2020), which performs the maximum likelihood estimation of similar and dissimilar data points. The likelihood is modeled with the inner product between two logits, and the individual logit models are expected to perform well on classification of single data points. While both approaches incorporate similar and dissimilar data points into their formulations, it is not clear whether these methods perform good classification performance from the theoretical perspective; indeed, their objective functions have not been directly connected to the classification risk.

In this letter, we propose a similar-dissimilar-unlabeled (SDU) classification method, where all of pairwise similarities and dissimilarities and unlabeled data serve for unbiased estimation of the classification risk. Like the SU classification method (Bao et al., 2018), our method does not require any geometrical assumptions on the data distribution and enables us to minimize the classification risk via empirical risk minimization. As preparation for constructing our SDU classification, we first develop a dissimilar-unlabeled (DU) classification method and a similar-dissimilar (SD) classification method, where only dissimilar and unlabeled data or similar and dissimilar data are required, respectively. We also show that these methods can be regarded as a special case of a very general framework of classification from unlabeled data (Lu, Niu, Menon, & Sugiyama, 2019). Then we combine the three risks in SU, DU, and SD classification, in a similar manner to positive-negative-unlabeled classification (Sakai, du Plessis, Niu, & Sugiyama, 2017) and train a classifier based on empirical risk minimization at last. We further propose a strategy to reduce the computation cost of hyperparameter tuning by ignoring the SU risk and combine the SD and DU risks for estimation of the classification risk. This strategy comes from the analysis of estimation error bounds for each of SU, DU, and SD classification methods; the bounds for DU/SD classification methods tend to be tighter than the bound of SD classification method. Finally, we experimentally demonstrate the practical usefulness of our method.

Our contributions can be summarized as follows:

• We develop DU and SD classification methods by extending the SU classification method and propose an SDU classification method as a general form of those methods (see section 3).

• We establish estimation error bounds for each method and confirm that unlabeled data help the estimation of the classification risk (see sections 4.1 and 4.3).

• From theoretical analysis, we find that estimation error bounds for the DU/SD classification methods tend to be tighter than that for the SU classification method and propose a strategy to reduce the computation cost in the SDU classification method (see section 4.2).

## 2  Preliminary

In this section, we introduce our problem setting and a generation model of pairwise similarities and dissimilarities and unlabeled data. Thereafter, we review the existing SU classification method.

### 2.1  Problem Setting

Let $X⊂Rd$ and $Y={+1,-1}$ be a $d$-dimensional example space and binary label space, respectively. Suppose that each labeled example $(x,y)∈X×Y$ is generated independently from the joint probability distribution with density $p(x,y)$. For simplicity, let $π+$ and $π-$ be class priors $p(y=+1)$ and $p(y=-1)$, which satisfy the condition $π++π-=1$, and $p+(x)$ and $p-(x)$ be class-conditional densities $p(x∣y=+1)$ and $p(x∣y=-1)$.

The standard goal of supervised binary classification is to obtain a classifier $f:X→R$, which minimizes the classification risk defined by
$R(f):=E(X,Y)∼p(x,y)ℓ(f(X),Y),$
(2.1)

where $E(X,Y)∼p(x,y)·$ denotes the expected value with respect to $(X,Y)$ over the joint density $p(x,y)$ and $ℓ:R×Y→R≥0$ is a loss function.

### 2.2  Generation Model of Training Data

We formulate the data generation process of pairwise data and unlabeled data as follows. First, two examples $(x,y)$ and $(x',y')$ are drawn from $p(x,y)$ independently, $p(x,x',y,y')=p(x,y)p(x',y')$, which also implies that $p(y,y')=p(y)p(y')$. After that, the pairwise information $τ∈{+1,-1}$ is associated with $(x,x')$, where $τ=+1$ if $y=y'$ and $τ=-1$ if $y≠y'$. We represent a pairwise similarity and dissimilarity by the triplet $(x,x',τ)$. In addition, we suppose that each pairwise example is generated independently. Under these assumptions, we can describe the generation model for $nSD$ pairwise training data as
$DSD:={(xSD,i,xSD,i',τi)}i=1nSD∼p(x,x',τ),$
(2.2)
where
$p(x,x',τ=+1)=p(τ=+1)p(x,x'∣τ=+1)=πSpS(x,x'),$
(2.3)
$p(x,x',τ=-1)=p(τ=-1)p(x,x'∣τ=-1)=πDpD(x,x'),$
(2.4)
$πS:=p(τ=+1)=p(y=+1)p(y'=+1)+p(y=-1)p(y'=-1)=π+2+π-2,$
(2.5)
$πD:=p(τ=-1)=p(y=+1)p(y'=-1)+p(y=-1)p(y'=+1)=2π+π-,$
(2.6)
$pS(x,x'):=p(x,x'∣τ=+1)=1πSp(x,x',y=+1,y'=+1)+p(x,x',y=-1,y'=-1))=1πSp(x,y=+1)p(x',y'=+1)+p(x,y=-1)p(x',y'=-1)=π+2πSp+(x)p+(x')+π-2πSp-(x)p-(x'),$
(2.7)
$pD(x,x'):=p(x,x'∣τ=-1)=1πDp(x,x',y=+1,y'=-1)+p(x,x',y=-1,y'=+1)=1πDp(x,y=+1)p(x',y'=-1)+p(x,y=-1)p(x',y'=+1)=12p+(x)p-(x')+12p-(x)p+(x').$
(2.8)
Similarly, we assume that $nU$ unlabeled examples are drawn from the marginal distribution of $x$ independently:
$DU:={xU,i}i=1nU∼pU(x),$
(2.9)
$wherepU(x):=π+p+(x)+π-p-(x).$
(2.10)
On the basis of pairwise information $τ$, we can divide $nSD$ pairs in $DSD$ into $nS$ similar pairs and $nD$ dissimilar pairs, where $nSD=nS+nD$:
$DS:={(xS,i,xS,i')}i=1nS=(x,x')∣x,x',τ=+1∈DSD,$
(2.11)
$DD:={(xD,i,xD,i')}i=1nD=(x,x')∣x,x',τ=-1∈DSD.$
(2.12)
With this notation, we can treat pairwise similarities and dissimilarities as if they were drawn from the above conditional distributions: $DS∼pS(x,x')$ and $DD∼pD(x,x')$.

### 2.3  SU Classification

In the seminal paper by Bao et al. (2018), the first method of SU classification was proposed, where the classification risk is estimated in an unbiased fashion from only similar pairs and unlabeled data as follows:

Proposition 1
(Theorem 1 in Bao et al., 2018). Suppose $π+≠12$. The classification risk in equation 2.1 can be equivalently represented as
$RSU(f)=πSE(X,X')∼pS(x,x')L˜(f(X))+L˜(f(X'))2+EX∼pU(x)L(f(X),-1),$
(2.13)
where
$L(z,t):=π+π+-π-ℓ(z,t)-π-π+-π-ℓ(z,-t),$
(2.14)
$L˜(z):=L(z,+1)-L(z,-1).$
(2.15)
We can train a classifier by minimizing $R^SU$, that is, the empirical approximation of $RSU$, computed from $(DS,DU)$:
$R^SU(f):=πSnS∑i=1nSL˜(f(xS,i))+L˜(f(xS,i'))2+1nU∑L(f(xU,i),-1).$
(2.16)
Note that the positive class proportion $π+$ is needed to compute $R^SU$. Bao et al. (2018) proposed its estimation procedure as well. Although pairwise similarities and unlabeled data are sufficient to solve a binary classification problem, we incorporate pairwise dissimilarities into their framework to further improve the classification performance of a classifier.

## 3  Proposed Method

In this section, we propose an SDU classification method, where the classification risk is estimated from pairwise similarities and dissimilarities and unlabeled data. As the first step to construct our method, we extend the SU classification method to DU and SD classification methods.

### 3.1  DU and SD Classification

As well as the SU classification method, the classification risk can be estimated from only dissimilar pairs and unlabeled data (DU), or similar pairs and dissimilar pairs (SD) as follows.

Theorem 1.
Suppose $π+≠12$. The classification risk in equation 2.1 can be equivalently represented as
$RDU(f)=πDE(X,X')∼pD(x,x')-L˜(f(X))+L˜(f(X'))2+EX∼pU(x)L(f(X),+1),$
(3.1)
$RSD(f)=πSE(X,X')∼pS(x,x')L(f(X),+1)+L(f(X'),+1)2+πDE(X,X')∼pD(x,x')L(f(X),-1)+L(f(X'),-1)2,$
(3.2)
where $L(z,t)$ and $L˜(z)$ are defined in equations 2.14 and 2.15, respectively.
These alternative forms of the classification risk give us empirical risk minimization methods with the empirical risks $R^DU$ and $R^SD$ defined as
$R^DU(f)=πDnD∑i=1nD-L˜(f(xD,i))+L˜(f(xD,i'))2+1nU∑i=1nUL(f(xU,i),+1),$
(3.3)
$R^SD(f)=πSnS∑i=1nSL(f(xS,i),+1)+L(f(xS,i'),+1)2+πDnD∑i=1nDL(f(xD,i),-1)+L(f(xD,i'),-1)2.$
(3.4)
Note that $R^DU$ can be computed from $(DD,DU)$, and $R^SD$ can be computed from $(DS,DD)$, respectively. We call the training method with $R^DU$ a DU classification method and that with $R^SD$ a SD classification method.

#### 3.1.1  Interpretation of SD Risk

The SD risk is not only an equivalent expression of the classification risk, but can also be interpreted as a binary classification risk that aims to predict “similar” and “dissimilar” as labels. We can rewrite the SD risk as
$RSD(f)=πSE(X,X')∼pS(x,x')L(f(X),+1)+L(f(X'),+1)2+πDE(X,X')∼pD(x,x')L(f(X),-1)+L(f(X'),-1)2=∫∫p(x,x',τ=+1)L(f(x),+1)+L(f(x'),+1)2dxdx'+∫∫p(x,x',τ=-1)L(f(x),-1)+L(f(x'),-1)2dxdx'=∫∫∑τ∈{+1,-1}p(x,x',τ)L(f(x),τ)+L(f(x'),τ)2dxdx'=E(X,X',T)∼p(x,x',τ)L(f(X),T)+L(f(X'),T)2.$
To interpret this expression of the risk from a different perspective, we visualize the landscape of loss $L$ in Figure 1, with $ℓ$ set to several standard margin-based losses. As can be seen in the figure, $L(z,t)$ has a similar profile to $ℓ(z,t)$ when $π+>12$. Otherwise, $L(z,t)$ is similar to $ℓ(z,-t)$. This enables us to give another interpretation of SD classification that it is a binary classification with loss function $L$, where a classifier $f$ takes input $X$ and predicts its associated pairwise label $T$. From this point of view, the relationship among SD, SU, and DU classification corresponds to that among positive-negative (PN), positive-unlabeled (PU), and negative-unlabeled (NU) classification (du Plessis, Niu, & Sugiyama, 2014, 2015). The main idea of the PU (resp. NU) classification method is to complement missing negative (resp. positive) information with unlabeled data. For example, the classification risk can be represented by positive and unlabeled data as follows:
$R(f)=E(X,Y)∼p(x,y)ℓ(f(X),Y)=π+EX∼p+(x)ℓ(f(X),+1)+π-EX∼p-(x)ℓ(f(X),-1)=π+EX∼p+(x)ℓ(f(X),+1)+EX∼pU(x)ℓ(f(X),-1)-π+EX∼p+(x)ℓ(f(X),-1)︸sinceπ-p(x)=pU(x)-π+p+(x)=π+EX∼p+(x)ℓ(f(X),+1)-ℓ(f(X),-1)+EX∼pU(x)ℓ(f(X),-1).$
(3.5)
This derivation is the same as that of theorem 1.
Figure 1:

Visualization of loss $L$ defined in equation 2.14. $ℓ$ is set to squared, double hinge, and logistic loss. The details of these loss functions are described in section 3.3 (see Table 1). As shown in the graphs, $L(z,t)$ approaches $ℓ(z,t)$ as $π+$ gets larger, and $L(z,t)$ approaches $ℓ(z,-t)$ as $π+$ gets smaller.

Figure 1:

Visualization of loss $L$ defined in equation 2.14. $ℓ$ is set to squared, double hinge, and logistic loss. The details of these loss functions are described in section 3.3 (see Table 1). As shown in the graphs, $L(z,t)$ approaches $ℓ(z,t)$ as $π+$ gets larger, and $L(z,t)$ approaches $ℓ(z,-t)$ as $π+$ gets smaller.

Furthermore, when loss $ℓ$ is symmetric (Ghosh, Manwani, & Sastry, 2015; Charoenphakdee, Lee, & Sugiyama, 2019), that is, for some $K∈R,ℓ(z,+1)+ℓ(z,-1)=K$, then the following relationship holds:

Corollary 1.
Assume $π+≠12$. We define $QSD$ by replacing $L$ in $RSD$ with $ℓ$ as follows:
$QSD(f):=E(X,X',T)∼p(x,x',τ)ℓ(f(X),T)+ℓ(f(X'),T)2.$
(3.6)
Suppose that $ℓ$ is a symmetric loss. Then $RSD$ and $QSD$ share the optimal solution as
$argminf∈FRSD(f)=argminf∈FQSD(f)π+>12,argmaxf∈FQSD(f)π+<12.$
(3.7)
Proof.
We show that $L(z,t)$ is a linear function of $ℓ(z,t)$:
$L(z,t)=π+π+-π-ℓ(z,t)-π-π+-π-ℓ(z,-t)=π+π+-π-ℓ(z,t)-π-π+-π-K-ℓ(z,t)=1π+-π-ℓ(z,t)-π-π+-π-K.$
By using the above relationship, we obtain
$RSD(f)=1π+-π-QSD(f)-π-π+-π-K,$

which results in equation 3.7.

If we use a symmetric loss in the classification risk, corollary 1 gives us practical advantages. For instance, when writing program code for a training algorithm, we do not have to implement loss $L$ by ourselves. Instead, we can treat “similar” and “dissimilar” labels as positive and negative labels associated with each point in a pair and use any standard binary classification algorithm with a loss function $ℓ$.

Actually, the relationship in corollary 1 holds for $RSU$ and $RDU$ as well. The alternative objective functions $QSU$ and $QDU$ are defined as
$QSU(f):=πSE(X,X')∼pS(x,x')ℓ˜(f(X))+ℓ˜(f(X'))2+EX∼pU(x)ℓ(f(X),-1),$
(3.8)
$QDU(f):=πDE(X,X')∼pD(x,x')-ℓ˜(f(X))+ℓ˜(f(X'))2+EX∼pU(x)ℓ(f(X),+1),$
(3.9)
where
$ℓ˜(z):=ℓ(z,+1)-ℓ(z,-1).$
(3.10)
Similar to equation 3.7, we observe that $QSU$ and $QDU$ have the same optimizers as $RSU$ and $RDU$, respectively. Moreover, we can confirm that $QSU$ corresponds to the PU risk in equation 3.5 ($QDU$ corresponds to the NU risk as well), which gives us an intuitive interpretation of the SU (resp. DU) classification method as the PU (resp. NU) classification method.

#### 3.1.2  Relation to UU Classification

Each of SU, DU, and SD classification is regarded as a special case of unlabeled-unlabeled (UU) classification (Lu et al., 2019), a very general framework in weakly supervised learning and enables us to train a classifier without any labeled data. In UU classification, we assume that two unlabeled training sets $Dtr$ and $Dtr'$ are available, which are drawn from two distinct marginal densities $ptr$ and $ptr'$, respectively:
$Dtr:={xtr,i}i=1ntr∼ptr(x),Dtr':={xtr,i'}i=1ntr'∼ptr'(x),ptr(x):=θp+(x)+(1-θ)p-(x),ptr'(x):=θ'p+(x)+(1-θ')p-(x),$
where $θ$ and $θ'$ are some constants satisfying $θ,θ'∈[0,1]$ and $θ≠θ'$. Then we can rewrite the classification risk with these densities as follows.
Proposition 2
(Theorem 4 in Lu et al., 2019). Assume that $θ>θ'$; otherwise, swap $ptr$ and $ptr'$ to make sure $θ>θ'$. Then, the classification risk in equation 2.1 can be equivalently represented as
$EX∼ptr(x)aℓ(f(X),+1)+bℓ(f(X),-1)+EX∼ptr'(x)cℓ(f(X),-1)+dℓ(f(X),+1),wherea:=(1-θ')π+θ-θ',b:=-θ'π-θ-θ',c:=θπ-θ-θ',d:=-(1-θ)π+θ-θ'.$
(3.11)
The risk expression in equation 3.11 enables us to train a classifier by minimizing the empirical risk computed from $(Dtr,Dtr')$. Now, we review the relationship between UU classification and SU, DU, SD classification. Since we assume that each example in a pair is drawn independently (see section 2.2), we can reduce the pairwise distributions into the pointwise distributions as
$p˜S(x):=∫pS(x,x')dx'=π+2πSp+(x)+π-2πSp-(x),$
(3.12)
$p˜D(x):=∫pD(x,x')dx'=12p+(x)+12p-(x).$
(3.13)
With this notation, each single point in $DS$ and $DD$ can be treated as if it was drawn from $p˜S(x)$ and $p˜D(x)$, respectively. Therefore, SU, DU, and SD classification correspond to special cases in UU classification:
$(θ,θ')=π+2πS,π+(SUclassification),12,π+(DUclassification),π+2πS,12(SDclassification).$
(3.14)
Note that the condition $θ≠θ'$ in UU classification corresponds to $π+≠12$ in SU, DU, and SD classification. If such a condition is not satisfied, none of them can be solved because unbiased risk estimators degenerate.

### 3.2  SDU Classification

Here, we propose an SDU classification method that incorporates all of pairwise similarities, pairwise dissimilarities, and unlabeled data into the empirical risk minimization framework. Our main idea is to combine the risks computed from SU, DU, and SD data, in a similar manner to positive-negative-unlabeled classification (Sakai et al., 2017) that is an unbiased risk estimation approach to semisupervised classification. Since each of $RSU(f)$, $RDU(f)$, and $RSD(f)$ is an equivalent expression of the true classification risk, the following convex combination of those risks is still equivalent to $R(f)$:
$RSDUγ(f):=γ1RSU(f)+γ2RDU(f)+γ3RSD(f),$
(3.15)
where $γ=(γ1,γ2,γ3)$ is the hyperparameter that satisfies $γ1,γ2,γ3≥0$, and $γ1+γ2+γ3=1$. In section 4.2, from the theoretical analysis, we propose a strategy to reduce the tuning cost of $γ$ by fixing $γ1=0$ .

### 3.3  Practical Implementation

We investigate the objective function with a linear classifier $f(x)=w⊤ϕ(x)+b$, where $w∈Rk$ and $b∈R$ are weights and $ϕ:Rd→Rk$ is a mapping function. Then the empirical risk minimization with $L2$ regularization can be described by
$minwR^SDUγ(w)+λ2∥w∥2,$
(3.16)
where $R^SDUγ$ is an empirical estimator of $RSDUγ$ and $λ>0$ is a parameter of $L2$ regularization. In the rest of this letter, we suppose that the loss function $ℓ$ is a margin-based loss function. As defined in Mohri, Rostamizadeh, Bach, and Talwalkar (2012), we call $ℓ$ a margin-based loss function if there exists $ψ:R→R≥0$ such that $ℓ(z,t)=ψ(tz).$ In general, the optimization problem in equation 3.16 is nonconvex even if $ℓ$ is a convex margin-based loss. However, if we choose $ℓ$ that satisfies the following property, the optimization problem becomes convex.
Theorem 2.
Suppose that the loss function $ℓ(z,t)$ is a convex margin-based loss, twice differentiable in z almost everywhere (for every fixed $t∈{±1}$), and satisfies the following condition:
$ℓ(z,+1)-ℓ(z,-1)=-z.$
(3.17)
Then the optimization problem in equation 3.16 is convex.

Several loss functions that satisfy the conditions in theorem 2 are shown in Table 1, borrowed from Patrini, Nielsen, Nock, and Carioni (2016) and Bao, Niu, & Sugiyama (2018). Next, we consider the optimization problem with the squared loss and the double hinge loss, respectively.

Table 1:

Margin-Based Loss Functions That Satisfy the Conditions in Theorem 2.

Loss Name$ψ(tz)$
Squared loss $14(tz-1)2$
Logistic loss $log(1+exp(-tz))$
Double hinge loss $max(-m,max(0,12-12tz))$
Loss Name$ψ(tz)$
Squared loss $14(tz-1)2$
Logistic loss $log(1+exp(-tz))$
Double hinge loss $max(-m,max(0,12-12tz))$

#### 3.3.1  Squared Loss

We consider the optimization problem in equation 3.16 with the squared loss defined by
$ℓSQ(z,t)=14(tz-1)2.$
(3.18)
For convenience, we denote pointwise samples in $DS$ and $DD$ as
$D˜S:={x˜S,i}i=12nS=⋃{xS,xS'∣(xS,xS')∈DS},$
(3.19)
$D˜D:={x˜D,i}i=12nD=⋃{xD,xD'∣(xD,xD')∈DD}.$
(3.20)
Then the objective function, $R^SDUγ(w)+λ2∥w∥2$, can be written as
$14w⊤γ3πS2nSXS⊤XS+πD2nDXD⊤XD+γ1+γ2nUXU⊤XU+2λIw+1π+-π--πS2nSγ1+γ32XS⊤1+πD2nDγ2+γ32XD⊤1+12nUγ1-γ2XU⊤1w+const.,$
(3.21)
where
$XS:=[ϕ(x˜S,1),⋯,ϕ(x˜S,2nS)]⊤,XD:=[ϕ(x˜D,1),⋯,ϕ(x˜D,2nD)]⊤,XU:=[ϕ(xU,1),⋯,ϕ(xU,nU)]⊤.$
We denote $1$ as the vector whose elements are all ones and $I$ as the identity matrix. Since this function has a nondegenerate quadratic form with respect to $w$, the solution of this minimization problem can be obtained analytically as
$w^=1π+-π-γ3πS2nSXS⊤XS+πD2nDXD⊤XD+γ1+γ2nUXU⊤XU+2λI-1×πSnSγ1+γ32XS⊤1+πDnDγ2+γ32XD⊤1+1nUγ1-γ2XU⊤1.$
(3.22)

#### 3.3.2  Double Hinge Loss

Standard hinge loss $ℓH(z,t)=max(0,1-tz)$ does not satisfy the condition in equation 3.17. As an alternative, the double hinge loss $ℓDH(z,t)=max(-tz,max(0,12-12tz)$ was proposed by du Plessis et al. (2015). The optimization problem in equation 3.16 with the double hinge loss can be solved by quadratic programming. The objective function, $R^SDUγ(w)+λ2∥w∥2$, can be represented by
$-γ1πS2nS(π+-π-)∑i=12nSw⊤ϕ(x˜S,i)+γ2πD2nD(π+-π-)∑i=12nDw⊤ϕ(x˜D,i)+γ3πS2nS(π+-π-)∑i=12nSπ+ℓDH(w⊤ϕ(x˜S,i),+1)-π-ℓDH(w⊤ϕ(x˜S,i),-1)-γ3πD2nD(π+-π-)∑i=12nDπ-ℓDH(w⊤ϕ(x˜D,i),+1)-π+ℓDH(w⊤ϕ(x˜D,i),-1)+1nU(π+-π-)∑i=1nU(γ2π+-γ1π-)ℓDH(w⊤ϕ(xU,i),+1)+(γ1π+-γ2π-)ℓDH(w⊤ϕ(xU,i),-1)+λ2w⊤w.$
(3.23)
Using slack variables $ξ={ξS,ξD,ξU}$ and $η={ηS,ηD,ηU}$, we can rewrite the optimization problem in equation 3.16 as
$minw,ξ,η-γ1πS2nS(π+-π-)1⊤XSw+γ2πD2nD(π+-π-)1⊤XDw+γ3π+πS2nS(π+-π-)1⊤ξS-γ3π-πS2nS(π+-π-)1⊤ηS-γ3π-πD2nD(π+-π-)1⊤ξD+γ3π+πD2nD(π+-π-)1⊤ηD+-γ1π-+γ2π+nU(π+-π-)1⊤ξU+γ1π+-γ2π-nU(π+-π-)1⊤ηU+12w⊤ws.t.ξS≥0,ξS≥12-12XSw,ξS≥-XSw,ηS≥0,ηS≥12+12XSw,ηS≥XSw,ξD≥0,ξD≥12-12XDw,ξD≥-XDw,ηD≥0,ηD≥12+12XDw,ηD≥XDw,ξU≥0,ξU≥12-12XUw,ξU≥-XUw,ηU≥0,ηU≥12+12XUw,ηU≥XUw,$
(3.24)
where $≥$ for vectors indicates elementwise inequality.

### 3.4  Class Prior Estimation from Pairwise Data

Although the exact positive class proportion $π+$ has to be known in advance to compute the empirical risk $R^SDU$, it is often unknown in practice. Here, we show that $π+$ can be estimated from the number of similar pairs $nS$ and the number of dissimilar pairs $nD$. The positive ratio in pointwise data $π+$ and the similar ratio in pairwise data $πS$ have the following relationship:
$π+=1+2πS-12(π+≥12),1-2πS-12(otherwise).$
(3.25)
The above equality is derived from $πS=π+2+(1-π+)2$. Since $π^S=nS/(nS+nD)$ is an unbiased estimator of $πS$, $π+$ can be estimated by plugging $π^S$ into equation 3.25.

## 4  Theoretical Analysis

In this section, we analyze estimation error bounds for our methods. We first derive estimation error bounds for the SU, DU, and SD classification methods via Rademacher complexity. By comparing these bounds, we find a nontrivial relationship in the performances of these methods. It also gives a strategy to reduce the cost of hyperparameter tuning in the SDU classification method. Finally, we derive an estimation error bound for the SDU classification method.

### 4.1  Estimation Error Bounds for SU, DU, and SD Classification

We investigate estimation error bounds for the SU, DU, and SD classification methods. Let $F⊂RX$ be a function class of the specified model:

Definition 1
(Rademacher Complexity). Let $n$ be a positive integer, $Z1,⋯,Zn$ be independent and identically distributed (i.i.d.) random variables drawn from a probability distribution with density $μ$, $H={h:Z→R}$be a class of measurable functions, and $σ=(σ1,⋯,σn)$ be Rademacher variables, that is, random variables taking $+1$ and $-1$ with even probabilities. Then the (expected) Rademacher complexity of $H$ is defined as
$R(H;n,μ):=EZ1,⋯,Zn∼μEσsuph∈H1n∑i=1nσih(Zi).$
(4.1)
For the function class $F$ and any probability density $μ$, we assume
$R(F;n,μ)≤CFn.$
(4.2)
This assumption holds for many models, such as linear-in-parameter model class $F=f(x)=w⊤ϕ(x)$ as shown in Mohri et al. (2012). Partially based on Bao et al. (2018), we have estimation error bounds for the SU, DU, and SD classification methods as follows.
Theorem 3.
Let $R(f)=E[ℓ(f(x),y)]$ be a classification risk for function $f$, $f*∈F$ be its minimizer, and $f^SU,f^DU,f^SD$ be minimizers of the empirical SU, DU, SD risks in $F$, respectively. Assume that $π+≠12$, the loss function $ℓ$ is $ρ$-Lipschitz function with respect to the first argument ($0<ρ<∞$), and all functions in the model class $F$ are bounded, that is, there exists a constant $Cb$ such that $∥f∥∞≤Cb$ for any $f∈F$. Let $Cℓ:=supt∈{±1}ℓ(Cb,t)$. For any $δ>0$, each of the following inequalities holds independently with probability at least $1-δ$:
$R(f^SU)-R(f*)≤CF,ℓ,δ2πS2nS+1nU,$
(4.3)
$R(f^DU)-R(f*)≤CF,ℓ,δ2πD2nD+1nU,$
(4.4)
$R(f^SD)-R(f*)≤CF,ℓ,δπS2nS+πD2nD,$
(4.5)
where
$CF,ℓ,δ=1|π+-π-|4ρCF+2Cℓ2log8δ.$
(4.6)

### 4.2  Comparison of SU, DU, and SD Bounds

Here, we compare the SU, DU, and SD classification methods from the perspective of their estimation error bounds. Under the generation process of similar and dissimilar pairs in equation 2.2, we have the following claim:

Theorem 4.

Suppose similar and dissimilar pairs follow the generation process in equation 2.2. We denote each right-hand side in equations 4.3 to 4.5 by $VSD$, $VSU$, and $VDU$, respectively. Then, $VDU≤VSU$ and $VSD≤VSU$ hold with the probability at least $1-exp(-cnSD)$ for some constant $c>0$.

Proof.
If $πS/2nS>πD/2nD$ holds, we have
$VSU-CF,ℓ,δ/nUVDU-CF,ℓ,δ/nU=πS/2nSπD/2nD>1$
and
$VSU-VSD=CF,ℓ,δπS2nS-πD2nD+1nU>CF,ℓ,δnU>0.$
These two inequalities imply $VDU and $VSD. Since we assume the generation process in equation 2.2, the class of each pair (i.e., similar or dissimilar) follows a Bernoulli distribution. Therefore, the number of pairs in each class follows a binomial distribution, namely, $nD∼Binomial(nSD,πD)$ and $nS=nSD-nD$. By using Chernoff's inequality in Okamoto (1959), we have
$pπS2nS≤πD2nD=pnD≤nSDπD2πS2+πD2≤exp-nSDπD2(1-πD)1-πDπS2+πD22.$
Finally, we obtain
$p(VDU≤VSU∧VSD≤VSU)≥1-exp-nSDπD2(1-πD)1-πDπS2+πD22.$

#### 4.2.1  SDDU Classification for Efficient Hyperparameter Search

Theorem 4 states that $max{VSD,VDU}≤VSD$ holds with high probability when $nSD$ is sufficiently large. It suggests that both DU and SD classification methods are likely to outperform the SU classification method when all of pairwise similarities and dissimilarities and unlabeled data are given in advance. Inspired by this result, we propose a strategy to reduce the computation cost by fixing $γ1=0$ in equation 3.15, that is, the classification risk is always estimated with the DU and SD risks. We call this method the SDDU classification method to distinguish it from the general SDU classification method. In sections 5.2 and 5.3, we experimentally demonstrate that the SDDU classification method performs at the same level as or better than the SDU classification method.

### 4.3  Estimation Error Bound for SDU Classification

We derive an estimation error bound for the SDU classification method. With the same technique as in theorem 3, we have the following bound:

Theorem 5.
Let $R(f)=E[ℓ(f(x),y)]$ be a classification risk for function $f$, $f*∈F$ be its minimizer, and $f^SDU∈F$ be a minimizer of the empirical risk $R^SDUγ$. Assume that $π+≠12$, the loss function $ℓ$ is $ρ$-Lipschitz function with respect to the first argument ($0<ρ<∞$), and all functions in the model class $F$ are bounded, that is, there exists a constant $Cb$ such that $∥f∥∞≤Cb$ for any $f∈F$. Let $Cℓ:=supt∈{±1}ℓ(Cb,t)$. For any $δ>0$, with probability at least $1-δ$,
$R(f^SDU)-R(f*)≤CF,ℓ,δ'(2γ1+γ3)πS2nS+(2γ2+γ3)πD2nD+(|γ1π--γ2π+|+|γ1π+-γ2π-|)1nU,$
(4.7)
where
$CF,ℓ,δ'=1|π+-π-|4ρCF+2Cℓ2log12δ.$
(4.8)

Theorem 5 ensures that the estimation error bound of $f^SDU$ diminishes asymptotically, that is, $R(f*)-R(f^SDU)→0$ as $nS,nD,nU→∞$. As a negative aspect, it should also be noted that $CF,ℓ,δ'$ is inversely proportional to $|π+-π-|$, which implies that the estimation error can increase as $π+$ and $π-$ are approaching each other.

## 5  Experiments

In this section, we experimentally investigate the behavior of the proposed methods on benchmark data sets. First, we compare the performances of the SU, DU, and SD classification methods to confirm that the SD and DU classification methods are likely to perform better than the SU classification method, as discussed in section 4.2. Second, we demonstrate that unlabeled data can improve the classification accuracy of the SDU classification method. Finally, we compare the performance of the SDU classification method and those of baseline methods.

We conducted experiments on 10 benchmark data sets obtained from the UCI Machine Learning Repository (Dua & Graff, 2017) and LIBSVM (Chang & Lin, 2011). To obtain pairwise training data, we first converted pointwise labeled data into pairs by coupling. Then we randomly subsampled pairwise similar and dissimilar data following the ratio of $πS$ and $πD$. To obtain unlabeled data, we randomly picked positive and negative data following the ratio of $π+$ and $π-$. The labeled data for testing are created in the same way as with unlabeled data, and the number of test data was set to 500.

In the SDU classification method, including the SU, DU, and SD classification methods, a linear-in-input model $f(x)=w⊤x+b$ is used as a classifier. The weight of $L2$ regularization was chosen from ${10-1,10-4,10-7}$. Each of the coefficient parameters in $(γ1,γ2,γ3)$ was chosen from ${0,13,23,1}$ subject to $γ1+γ2+γ3=1$. All hyperparameters were tuned with 5-fold cross validation on the empirical classification error computed from similarities and dissimilarities, that is, $R^SD$ equipped with the zero-one loss. The squared loss is used for experiments in sections 5.1 and 5.2, and the double hinge loss is used in section 5.3. We assumed that the true positive class proportion $π+$ is known for computing the empirical risk.

### 5.1  Comparison of SU, DU, and SD Performances

We compared the performances of the SU, DU, and SD classification methods. We set the number of unlabeled training data to 500 and the number of pairwise training data to each of {50, 100, 200, 300, 400, 500}. Training and test data were generated with maintaining $π+=0.7$. The misclassification rates for each method are plotted in Figure 2.

### 5.2  Performance Improvement with Unlabeled Data

We investigated the effect of unlabeled data in the SDU classification method. The number of pairwise data was fixed: $nSD=50$. As with the previous experiment, training and test data were generated with maintaining $π+=0.7$. Three methods, the SD, SDU, and SDDU classification methods, are evaluated in each setting. The misclassification rates for each method are plotted in Figure 3.

### 5.3  Benchmark Comparison of SDU and Existing Methods

We evaluated the performances of the SDU/SDDU classification methods and six baseline methods on benchmark data sets. We set $nU=500$ and $nSD={50,200}$. In each trial, the misclassification rate was measured with 500 test examples. To see the influence of the class prior on our methods, we conducted experiments in a moderately imbalanced case ($π+=0.7$) and a fairly imbalanced case ($π+=0.9$), respectively. We report the results for each setup in Table 2. The details of the baseline methods are described below.

Table 2:

Mean Misclassification Rate and Standard Error on Different Benchmark Data Sets over 50 Trials.

(a) $nSD=50,π+=0.7$
ProposedBaselines
Data Set# Dim.SDUSDDUSUKMCKMSSPITMLCRLOVPCMCL
adult 123 26.4 (1.12) 23.6 (0.83) 35.2 (1.03) 35.0 (0.81) 33.4 (1.05) 30.6 (0.29) 38.2 (0.64) 38.3 (0.99) 41.4 (0.87) 26.0 (1.10)
banana 33.9 (0.73) 33.5 (0.68) 35.7 (0.72) 47.1 (0.36) 47.3 (0.35) 41.3 (0.73) 47.1 (0.36) 43.7 (0.63) 38.0 (0.98) 33.0 (0.77)
codrna 20.1 (1.14) 18.9 (1.17) 24.6 (1.01) 37.4 (0.50) 38.5 (0.40) 45.4 (0.99) 37.4 (0.50) 41.0 (0.66) 37.1 (0.99) 32.1 (1.20)
ijcnn1 22 33.0 (0.73) 32.2 (0.61) 36.5 (0.96) 44.5 (0.63) 45.3 (0.50) 40.0 (0.79) 44.9 (0.71) 44.9 (0.60) 42.1 (0.89) 32.9 (1.13)
magic 10 34.5 (0.83) 34.2 (0.87) 40.0 (0.82) 47.6 (0.21) 48.2 (0.18) 47.3 (0.27) 47.6 (0.21) 42.2 (0.52) 43.2 (0.74) 29.9 (1.20)
phishing 68 22.1 (1.21) 21.4 (1.11) 27.3 (1.30) 37.4 (0.33) 37.4 (0.31) 31.9 (0.30) 37.4 (0.34) 33.8 (1.09) 39.7 (0.81) 17.8 (1.58)
phoneme 29.6 (0.78) 29.2 (0.79) 32.7 (0.98) 32.2 (0.34) 31.1 (0.45) 33.5 (1.01) 32.2 (0.34) 35.5 (1.01) 37.8 (0.92) 32.1 (0.91)
spambase 57 21.0 (1.40) 20.4 (1.27) 31.7 (1.71) 36.3 (1.08) 35.8 (1.06) 29.6 (0.31) 39.1 (1.10) 37.1 (1.14) 39.8 (0.92) 14.3 (0.83)
w8a 300 36.2 (1.26) 33.0 (1.19) 41.3 (0.80) 30.7 (0.26) 34.0 (0.67) 36.0 (0.69) 31.1 (0.29) 31.7 (0.36) 43.4 (0.75) 39.3 (1.07)
waveform 21 17.3 (0.98) 15.8 (0.88) 26.7 (1.48) 48.5 (0.17) 48.4 (0.18) 46.7 (0.35) 48.5 (0.17) 30.6 (1.66) 36.4 (1.43) 12.2 (0.31)
# Outperform
(b) $nSD=50,π+=0.9([[$
Proposed Baselines
Data Set # Dim. SDU SDDU SU KM CKM SSP ITML CRL OVPC MCL
adult 123 9.8 (0.43) 9.7 (0.41) 23.7 (0.66) 22.0 (1.55) 33.7 (1.42) 11.5 (0.26) 28.5 (1.46) 41.7 (0.79) 38.5 (0.94) 14.6 (1.25)
banana 10.4 (0.23) 10.4 (0.22) 12.7 (0.63) 45.5 (0.31) 46.0 (0.30) 31.7 (1.47) 45.4 (0.31) 39.3 (0.80) 23.9 (1.97) 10.7 (0.32)
codrna 6.9 (0.39) 6.9 (0.45) 15.8 (0.70) 32.9 (1.12) 36.6 (1.05) 42.3 (1.93) 32.8 (1.13) 38.0 (1.54) 33.1 (1.65) 10.0 (0.22)
ijcnn1 22 10.0 (0.31) 9.9 (0.31) 14.3 (0.62) 40.0 (1.00) 41.0 (0.81) 29.4 (1.34) 39.1 (1.12) 41.1 (0.73) 36.7 (1.55) 14.0 (1.32)
magic 10 11.7 (0.33) 11.6 (0.29) 21.8 (0.85) 36.4 (0.45) 38.8 (0.38) 45.6 (0.44) 36.4 (0.45) 29.5 (1.15) 41.1 (0.99) 14.5 (0.96)
phishing 68 8.7 (0.38) 8.5 (0.32) 18.0 (0.85) 24.7 (0.36) 25.7 (0.38) 13.6 (0.41) 24.8 (0.36) 33.1 (1.35) 40.5 (0.94) 9.2 (0.94)
phoneme 11.1 (0.27) 11.3 (0.27) 15.8 (0.67) 40.4 (0.51) 40.8 (0.35) 26.7 (1.76) 40.1 (0.56) 36.2 (1.39) 31.3 (1.62) 11.7 (0.85)
spambase 57 8.8 (0.23) 8.5 (0.24) 16.7 (0.53) 18.6 (1.48) 32.3 (1.12) 11.6 (0.34) 21.7 (1.46) 33.2 (1.88) 40.0 (0.98) 8.1 (0.78)
w8a 300 8.3 (0.50) 8.3 (0.50) 26.0 (0.56) 11.2 (0.19) 11.7 (0.22) 18.3 (1.00) 11.7 (0.26) 14.6 (0.80) 35.8 (1.23) 27.4 (1.52)
waveform 21 5.2 (0.24) 5.1 (0.24) 8.2 (0.73) 48.6 (0.17) 48.5 (0.18) 47.6 (0.23) 48.6 (0.17) 44.1 (0.71) 35.1 (1.52) 5.4 (0.74)
# Outperform 10 10
adult 123 18.0 (0.46) 17.8 (0.31) 21.3 (0.54) 36.7 (0.76) 28.1 (0.90) 30.7 (0.31) 39.1 (0.64) 33.0 (0.95) 42.7 (0.79) 19.7 (0.39)
banana 30.5 (0.42) 31.0 (0.42) 32.8 (0.52) 47.5 (0.24) 47.5 (0.24) 33.5 (1.29) 47.5 (0.24) 42.3 (0.50) 35.2 (0.77) 32.0 (0.67)
codrna 10.8 (0.69) 9.4 (0.41) 18.1 (0.88) 37.2 (0.51) 40.5 (0.50) 46.8 (0.72) 37.4 (0.54) 36.6 (1.33) 27.6 (1.28) 10.5 (0.79)
ijcnn1 22 23.3 (0.44) 22.6 (0.40) 28.4 (0.64) 45.8 (0.29) 46.9 (0.35) 40.5 (0.78) 46.4 (0.35) 44.3 (0.66) 43.1 (0.81) 16.2 (0.64)
magic 10 25.9 (0.65) 25.6 (0.55) 30.2 (0.71) 48.0 (0.18) 48.3 (0.17) 47.4 (0.29) 48.0 (0.18) 40.3 (0.59) 43.4 (0.77) 22.1 (0.68)
phishing 68 12.0 (0.42) 12.0 (0.41) 17.2 (0.87) 37.4 (0.31) 37.2 (0.30) 31.6 (0.29) 37.4 (0.31) 22.8 (1.34) 42.9 (0.67) 7.0 (0.18)
phoneme 25.5 (0.49) 25.5 (0.49) 27.5 (0.67) 32.2 (0.31) 29.0 (0.60) 28.0 (0.73) 32.0 (0.37) 32.9 (1.07) 37.1 (0.90) 25.5 (0.51)
spambase 57 12.8 (0.27) 12.3 (0.23) 16.2 (0.62) 38.2 (1.20) 29.6 (0.56) 29.4 (0.29) 40.4 (1.14) 34.8 (1.34) 39.6 (1.05) 9.5 (0.20)
w8a 300 20.4 (0.80) 18.8 (0.69) 35.8 (0.71) 30.7 (0.27) 43.7 (0.61) 32.6 (0.60) 31.7 (0.33) 33.2 (0.69) 45.1 (0.57) 27.6 (0.76)
waveform 21 12.8 (0.29) 12.6 (0.29) 15.6 (0.59) 48.5 (0.16) 48.5 (0.14) 46.9 (0.31) 48.4 (0.16) 17.5 (1.40) 35.3 (1.26) 10.7 (0.19)
# Outperform
(d) $nSD=200,π+=0.9([[[$
Proposed Baselines
Data Set # Dim. SDU SDDU SU KM CKM SSP ITML CRL OVPC MCL
adult 123 8.4 (0.24) 8.3 (0.24) 11.2 (0.32) 27.4 (1.41) 43.6 (0.95) 11.1 (0.27) 27.8 (1.25) 43.3 (0.76) 39.3 (1.11) 9.0 (0.21)
banana 10.2 (0.19) 10.2 (0.19) 10.5 (0.24) 45.5 (0.30) 46.6 (0.32) 25.8 (1.77) 45.5 (0.29) 40.0 (0.81) 24.1 (1.63) 10.2 (0.19)
codrna 4.1 (0.18) 4.0 (0.19) 9.8 (0.39) 32.2 (1.10) 40.4 (0.79) 40.7 (2.11) 32.5 (1.17) 38.1 (1.16) 29.1 (1.53) 7.6 (0.20)
ijcnn1 22 8.4 (0.20) 8.3 (0.20) 9.4 (0.24) 40.4 (0.82) 43.1 (0.63) 27.7 (1.38) 41.1 (0.90) 41.2 (1.00) 38.9 (1.26) 7.7 (0.16)
magic 10 10.3 (0.23) 10.2 (0.22) 16.8 (0.73) 37.0 (0.33) 41.4 (0.34) 45.2 (0.43) 37.0 (0.32) 32.7 (1.50) 38.6 (1.35) 10.0 (0.19)
phishing 68 6.3 (0.21) 6.3 (0.22) 8.9 (0.38) 24.4 (0.26) 27.6 (0.33) 13.7 (0.38) 24.5 (0.28) 38.4 (1.22) 40.8 (0.81) 3.7 (0.13)
phoneme 10.3 (0.21) 10.3 (0.19) 12.4 (0.41) 40.2 (0.54) 40.5 (0.39) 24.9 (1.70) 40.3 (0.54) 33.2 (1.53) 34.1 (1.50) 10.2 (0.19)
spambase 57 7.5 (0.19) 7.5 (0.19) 8.1 (0.24) 20.2 (1.33) 40.4 (0.60) 10.9 (0.25) 22.9 (1.18) 31.8 (1.31) 40.6 (1.17) 5.9 (0.15)
w8a 300 6.0 (0.18) 6.0 (0.18) 17.2 (0.41) 11.2 (0.21) 18.1 (1.07) 12.6 (0.67) 11.7 (0.24) 12.8 (0.48) 38.8 (0.99) 9.1 (0.30)
waveform 21 4.5 (0.13) 4.6 (0.14) 5.2 (0.22) 48.5 (0.17) 48.5 (0.20) 47.6 (0.22) 48.5 (0.17) 39.3 (1.13) 34.7 (1.54) 4.4 (0.14)
# Outperform
(a) $nSD=50,π+=0.7$
ProposedBaselines
Data Set# Dim.SDUSDDUSUKMCKMSSPITMLCRLOVPCMCL
adult 123 26.4 (1.12) 23.6 (0.83) 35.2 (1.03) 35.0 (0.81) 33.4 (1.05) 30.6 (0.29) 38.2 (0.64) 38.3 (0.99) 41.4 (0.87) 26.0 (1.10)
banana 33.9 (0.73) 33.5 (0.68) 35.7 (0.72) 47.1 (0.36) 47.3 (0.35) 41.3 (0.73) 47.1 (0.36) 43.7 (0.63) 38.0 (0.98) 33.0 (0.77)
codrna 20.1 (1.14) 18.9 (1.17) 24.6 (1.01) 37.4 (0.50) 38.5 (0.40) 45.4 (0.99) 37.4 (0.50) 41.0 (0.66) 37.1 (0.99) 32.1 (1.20)
ijcnn1 22 33.0 (0.73) 32.2 (0.61) 36.5 (0.96) 44.5 (0.63) 45.3 (0.50) 40.0 (0.79) 44.9 (0.71) 44.9 (0.60) 42.1 (0.89) 32.9 (1.13)
magic 10 34.5 (0.83) 34.2 (0.87) 40.0 (0.82) 47.6 (0.21) 48.2 (0.18) 47.3 (0.27) 47.6 (0.21) 42.2 (0.52) 43.2 (0.74) 29.9 (1.20)
phishing 68 22.1 (1.21) 21.4 (1.11) 27.3 (1.30) 37.4 (0.33) 37.4 (0.31) 31.9 (0.30) 37.4 (0.34) 33.8 (1.09) 39.7 (0.81) 17.8 (1.58)
phoneme 29.6 (0.78) 29.2 (0.79) 32.7 (0.98) 32.2 (0.34) 31.1 (0.45) 33.5 (1.01) 32.2 (0.34) 35.5 (1.01) 37.8 (0.92) 32.1 (0.91)
spambase 57 21.0 (1.40) 20.4 (1.27) 31.7 (1.71) 36.3 (1.08) 35.8 (1.06) 29.6 (0.31) 39.1 (1.10) 37.1 (1.14) 39.8 (0.92) 14.3 (0.83)
w8a 300 36.2 (1.26) 33.0 (1.19) 41.3 (0.80) 30.7 (0.26) 34.0 (0.67) 36.0 (0.69) 31.1 (0.29) 31.7 (0.36) 43.4 (0.75) 39.3 (1.07)
waveform 21 17.3 (0.98) 15.8 (0.88) 26.7 (1.48) 48.5 (0.17) 48.4 (0.18) 46.7 (0.35) 48.5 (0.17) 30.6 (1.66) 36.4 (1.43) 12.2 (0.31)
# Outperform
(b) $nSD=50,π+=0.9([[$
Proposed Baselines
Data Set # Dim. SDU SDDU SU KM CKM SSP ITML CRL OVPC MCL
adult 123 9.8 (0.43) 9.7 (0.41) 23.7 (0.66) 22.0 (1.55) 33.7 (1.42) 11.5 (0.26) 28.5 (1.46) 41.7 (0.79) 38.5 (0.94) 14.6 (1.25)
banana 10.4 (0.23) 10.4 (0.22) 12.7 (0.63) 45.5 (0.31) 46.0 (0.30) 31.7 (1.47) 45.4 (0.31) 39.3 (0.80) 23.9 (1.97) 10.7 (0.32)
codrna 6.9 (0.39) 6.9 (0.45) 15.8 (0.70) 32.9 (1.12) 36.6 (1.05) 42.3 (1.93) 32.8 (1.13) 38.0 (1.54) 33.1 (1.65) 10.0 (0.22)
ijcnn1 22 10.0 (0.31) 9.9 (0.31) 14.3 (0.62) 40.0 (1.00) 41.0 (0.81) 29.4 (1.34) 39.1 (1.12) 41.1 (0.73) 36.7 (1.55) 14.0 (1.32)
magic 10 11.7 (0.33) 11.6 (0.29) 21.8 (0.85) 36.4 (0.45) 38.8 (0.38) 45.6 (0.44) 36.4 (0.45) 29.5 (1.15) 41.1 (0.99) 14.5 (0.96)
phishing 68 8.7 (0.38) 8.5 (0.32) 18.0 (0.85) 24.7 (0.36) 25.7 (0.38) 13.6 (0.41) 24.8 (0.36) 33.1 (1.35) 40.5 (0.94) 9.2 (0.94)
phoneme 11.1 (0.27) 11.3 (0.27) 15.8 (0.67) 40.4 (0.51) 40.8 (0.35) 26.7 (1.76) 40.1 (0.56) 36.2 (1.39) 31.3 (1.62) 11.7 (0.85)
spambase 57 8.8 (0.23) 8.5 (0.24) 16.7 (0.53) 18.6 (1.48) 32.3 (1.12) 11.6 (0.34) 21.7 (1.46) 33.2 (1.88) 40.0 (0.98) 8.1 (0.78)
w8a 300 8.3 (0.50) 8.3 (0.50) 26.0 (0.56) 11.2 (0.19) 11.7 (0.22) 18.3 (1.00) 11.7 (0.26) 14.6 (0.80) 35.8 (1.23) 27.4 (1.52)
waveform 21 5.2 (0.24) 5.1 (0.24) 8.2 (0.73) 48.6 (0.17) 48.5 (0.18) 47.6 (0.23) 48.6 (0.17) 44.1 (0.71) 35.1 (1.52) 5.4 (0.74)
# Outperform 10 10
adult 123 18.0 (0.46) 17.8 (0.31) 21.3 (0.54) 36.7 (0.76) 28.1 (0.90) 30.7 (0.31) 39.1 (0.64) 33.0 (0.95) 42.7 (0.79) 19.7 (0.39)
banana 30.5 (0.42) 31.0 (0.42) 32.8 (0.52) 47.5 (0.24) 47.5 (0.24) 33.5 (1.29) 47.5 (0.24) 42.3 (0.50) 35.2 (0.77) 32.0 (0.67)
codrna 10.8 (0.69) 9.4 (0.41) 18.1 (0.88) 37.2 (0.51) 40.5 (0.50) 46.8 (0.72) 37.4 (0.54) 36.6 (1.33) 27.6 (1.28) 10.5 (0.79)
ijcnn1 22 23.3 (0.44) 22.6 (0.40) 28.4 (0.64) 45.8 (0.29) 46.9 (0.35) 40.5 (0.78) 46.4 (0.35) 44.3 (0.66) 43.1 (0.81) 16.2 (0.64)
magic 10 25.9 (0.65) 25.6 (0.55) 30.2 (0.71) 48.0 (0.18) 48.3 (0.17) 47.4 (0.29) 48.0 (0.18) 40.3 (0.59) 43.4 (0.77) 22.1 (0.68)
phishing 68 12.0 (0.42) 12.0 (0.41) 17.2 (0.87) 37.4 (0.31) 37.2 (0.30) 31.6 (0.29) 37.4 (0.31) 22.8 (1.34) 42.9 (0.67) 7.0 (0.18)
phoneme 25.5 (0.49) 25.5 (0.49) 27.5 (0.67) 32.2 (0.31) 29.0 (0.60) 28.0 (0.73) 32.0 (0.37) 32.9 (1.07) 37.1 (0.90) 25.5 (0.51)
spambase 57 12.8 (0.27) 12.3 (0.23) 16.2 (0.62) 38.2 (1.20) 29.6 (0.56) 29.4 (0.29) 40.4 (1.14) 34.8 (1.34) 39.6 (1.05) 9.5 (0.20)
w8a 300 20.4 (0.80) 18.8 (0.69) 35.8 (0.71) 30.7 (0.27) 43.7 (0.61) 32.6 (0.60) 31.7 (0.33) 33.2 (0.69) 45.1 (0.57) 27.6 (0.76)
waveform 21 12.8 (0.29) 12.6 (0.29) 15.6 (0.59) 48.5 (0.16) 48.5 (0.14) 46.9 (0.31) 48.4 (0.16) 17.5 (1.40) 35.3 (1.26) 10.7 (0.19)
# Outperform
(d) $nSD=200,π+=0.9([[[$
Proposed Baselines
Data Set # Dim. SDU SDDU SU KM CKM SSP ITML CRL OVPC MCL
adult 123 8.4 (0.24) 8.3 (0.24) 11.2 (0.32) 27.4 (1.41) 43.6 (0.95) 11.1 (0.27) 27.8 (1.25) 43.3 (0.76) 39.3 (1.11) 9.0 (0.21)
banana 10.2 (0.19) 10.2 (0.19) 10.5 (0.24) 45.5 (0.30) 46.6 (0.32) 25.8 (1.77) 45.5 (0.29) 40.0 (0.81) 24.1 (1.63) 10.2 (0.19)
codrna 4.1 (0.18) 4.0 (0.19) 9.8 (0.39) 32.2 (1.10) 40.4 (0.79) 40.7 (2.11) 32.5 (1.17) 38.1 (1.16) 29.1 (1.53) 7.6 (0.20)
ijcnn1 22 8.4 (0.20) 8.3 (0.20) 9.4 (0.24) 40.4 (0.82) 43.1 (0.63) 27.7 (1.38) 41.1 (0.90) 41.2 (1.00) 38.9 (1.26) 7.7 (0.16)
magic 10 10.3 (0.23) 10.2 (0.22) 16.8 (0.73) 37.0 (0.33) 41.4 (0.34) 45.2 (0.43) 37.0 (0.32) 32.7 (1.50) 38.6 (1.35) 10.0 (0.19)
phishing 68 6.3 (0.21) 6.3 (0.22) 8.9 (0.38) 24.4 (0.26) 27.6 (0.33) 13.7 (0.38) 24.5 (0.28) 38.4 (1.22) 40.8 (0.81) 3.7 (0.13)
phoneme 10.3 (0.21) 10.3 (0.19) 12.4 (0.41) 40.2 (0.54) 40.5 (0.39) 24.9 (1.70) 40.3 (0.54) 33.2 (1.53) 34.1 (1.50) 10.2 (0.19)
spambase 57 7.5 (0.19) 7.5 (0.19) 8.1 (0.24) 20.2 (1.33) 40.4 (0.60) 10.9 (0.25) 22.9 (1.18) 31.8 (1.31) 40.6 (1.17) 5.9 (0.15)
w8a 300 6.0 (0.18) 6.0 (0.18) 17.2 (0.41) 11.2 (0.21) 18.1 (1.07) 12.6 (0.67) 11.7 (0.24) 12.8 (0.48) 38.8 (0.99) 9.1 (0.30)
waveform 21 4.5 (0.13) 4.6 (0.14) 5.2 (0.22) 48.5 (0.17) 48.5 (0.20) 47.6 (0.22) 48.5 (0.17) 39.3 (1.13) 34.7 (1.54) 4.4 (0.14)
# Outperform

Note: Bold numbers indicate outperforming methods, chosen by one-sided $t$-test with a significance level of 5% # Dim. $=$ number of dimensions.

#### 5.3.1  SU Classification (SU)

The first baseline method is the SU classification method (Bao et al., 2018), where the classification risk is estimated from similar pairs and unlabeled data in an unbiased manner. This method is a special case of SDU classification where the coefficient parameters are fixed as $(γ1,γ2,γ3)=(1,0,0)$.

#### 5.3.2  KMeans Clustering (KM)

K-means clustering (MacQueen, 1967) is one of the most popular unsupervised methods. It is applied to training data by ignoring all pairwise information. We predicted labels of test data with learned clusters.

#### 5.3.3  Constrained KMeans Clustering (CKM)

Constrained K-means clustering (Wagstaff et al., 2001) is a semisupervised clustering method based on K-means clustering, where pairwise similarities and dissimilarities are treated as must-links or cannot-links.

#### 5.3.4  Semisupervised Spectral Clustering (SSP)

Semisupervised spectral clustering was proposed in Chen and Feng (2012), where similar and dissimilar labels are propagated through an affinity matrix. We set $k=5$ for constructing the affinity matrix with k-nearest-neighbors graph and $σ2=1$ for a precision parameter used in similarity measurement.

#### 5.3.5  Information-Theoretical Metric Learning (ITML)

Information-theoretical metric learning (Davis et al., 2007) is an algorithm to learn a matrix that parameterizes the Mahalanobis distance on given data points. Similar and dissimilar pairs are used for regularizing the covariance matrix. For test samples prediction, k-means clustering was applied with the obtained metric. We used the identity matrix as prior information, and a slack parameter $γ$ was set to 1.

Figure 2:

Average misclassification rate and standard error as a function of the number of similar and dissimilar pairs over 50 trials. Performances are shown for the SU (red), DU (green), and SD (blue) classification methods.

Figure 2:

Average misclassification rate and standard error as a function of the number of similar and dissimilar pairs over 50 trials. Performances are shown for the SU (red), DU (green), and SD (blue) classification methods.

Figure 3:

Average misclassification rate and standard error as a function of the number of unlabeled samples over 100 trials. Performances are shown for the SD (red), SDU (green), and SDDU (blue) classification methods.

Figure 3:

Average misclassification rate and standard error as a function of the number of unlabeled samples over 100 trials. Performances are shown for the SD (red), SDU (green), and SDDU (blue) classification methods.

#### 5.3.6  Contrastive Learning (CRL)

Contrastive learning (Arora et al., 2019) is another framework for learning a useful representation by leveraging similarity information. We used a linear model $g(x)=Wx$, where $W∈Rd'×d$ as an embedding function from input space to representation space. In this experiment, we fixed $d'$ to 10 for all data sets. Each triplet used for training was created by concatenating a similar pair and an example randomly picked from unlabeled data. With the learned representations, K-means clustering was applied in the same manner as the KM method.

#### 5.3.7  On the Value of Pairwise Constraints (OVPC)

A classification-based approach was proposed in Zhang and Yan (2007), where an auxiliary classifier is trained on the feature vectors obtained from pairwise examples. The trained classifier is converted into a function that can be applied to pointwise prediction. The weight of $L2$ regularization is chosen from ${10-1,10-4,10-7}$ by five-fold cross-validation.

#### 5.3.8  Meta-Classification Likelihood (MCL)

A meta-learning approach was recently proposed by Hsu et al. (2019), where the objective is maximum likelihood estimation over similar and dissimilar labels. The conditional class probability was modeled by $p(y=1∣x)={1+exp(w⊤x+b)}-1$. A stochastic gradient descent algorithm was applied for optimization.

#### 5.3.9  Setup for Clustering Algorithms

For clustering methods, the number of clusters was set to two. To evaluate the accuracy of k-means-based clustering methods (i.e., KM, CKM, and ITML), test samples were completely separated from training samples. The labels of test samples are predicted based on the clusters obtained from only training samples. For SSP, the clustering algorithm was applied to both train and test samples so that we could predict for test samples. Since there is no explicit positive or negative assignment in clustering methods, their performances are evaluated by $min(r,1-r)$, where $r$ is misclassification rate.

### 5.4  Discussion

In section 5.3, we stated that the SD and DU classification methods are likely to outperform the SU classification method, which comes from the comparison of their estimation error bounds. As shown in Figure 2, we confirmed that the misclassification rates of the SU, DU, and SD classification methods are consistent with this statement.

Figure 3 indicates that more unlabeled data lead to better classification performance for the SDU and SDDU classification methods. We also found that the SDDU classification method not only reduces the computation cost for tuning the coefficient parameters but also often outperforms the SDU classification method. It might indicate the difficulty in tuning the coefficient parameters $(γ1,γ2,γ3)$ only with similarities and dissimilarities for cross validation.

Table 2 demonstrates that the SDU and SDDU classification methods perform better than or comparable to other baselines in many scenarios. Specifically, we observed that the superiority of the proposed methods becomes outstanding in the situations where the number of pairwise data is limited and the positive and negative class priors are fairly imbalanced (see Table 2b). The first property suggests that the advantage gained from unlabeled data becomes significant when the amount of pairwise supervision is relatively small. The second one is consistent with the theoretical analysis in section 4.3, which states that the estimation error of the proposed method can increase as two class priors are approaching each other. Furthermore, we confirmed that our methods always benefit from the increased number of pairwise data, while most other clustering-based methods do not.

## 6  Conclusion and Future Work

In this letter, we proposed a novel weakly supervised classification method, similar-dissimilar-unlabeled (SDU) classification, where the classification risk is computed from pairwise similarities and dissimilarities and unlabeled data. We derived the estimation error bound for the proposed method and confirmed convergence to the optimal solution. From the theoretical analysis, we developed a strategy to reduce the computation cost for tuning the hyperparameter. Through experiments on benchmark data sets, we demonstrated that our SDU classification method performs better than baseline methods.

We discuss three important directions for future work. First, further research in a multiclass classification scenario is required. Our formulation relies on the connection between classification of similarity and classification of binary class labels. Since both are classification with binary outcomes, the extension to the multiclass case is not straightforward unless additional information is available. Second, in the SDU classification method, the positive and negative class proportions must not be equal, that is, $π+≠π-$. Even if they are not exactly equal, the estimation error can increase when $π+→12$, as mentioned in section 4.3. Our recent study (Bao, Shimada, Xu, Sato, & Sugiyama, 2020) partially overcomes this problem by ignoring the sign identification of a classifier, that is, a classifier is trained to minimize or maximize the classification error, but we cannot know which optimization is achieved without auxiliary information. Finally, the use of different types of pairwise supervision should be explored. Although this letter focused on binary representations of similarity and dissimilarity information, it would be more appealing if we can extend our method to handle other types of pairwise supervision, for example, confidence score (Ishida, Niu, & Sugiyama, 2018) and triplet comparison (Schroff, Kalenichenko, & Philbin, 2015; Cui, Charoenphakdee, Sato, & Sugiyama, 2020).

## Appendix: Proofs of Theorems

In this appendix, we give complete proofs of the theorems in sections 3 and 4.

### A.1  Proof of Theorem 1

We can express the joint density for pairwise unlabeled examples as
$p(x,x')=p(x,x',τ=+1)+p(x,x',τ=-1)=πSpS(x.x')+πDpD(x.x').$
(A.1)
This provides the following relationship in conditional expectations:
$E(X,X')∼p(x,x')[·]=πSE(X,X')∼pS(x,x')[·]+πDE(X,X')∼pD(x,x')[·].$
(A.2)
In addition, we can rewrite the conditional expectation over pointwise unlabeled data into that over pairwise unlabeled data. For a binary variable $t∈{+1,-1}$, we have
$EX∼pU(x)L(f(X),t)=12EX∼pU(x)L(f(X),t)+12EX'∼pU(x)L(f(X'),t)=12E(X,X')∼p(x,x')L(f(X),t)+12E(X,X')∼p(x,x')L(f(X'),t)=E(X,X')∼p(x,x')L(f(X),t)+L(f(X'),t)2.$
(A.3)
With equations A.2 and A.3, we can transform the SU risk defined in equation 2.13 into the DU risk and the SD risk as follows:
$RSU(f)=πSE(X,X')∼pS(x,x')L˜(f(X))+L˜(f(X'))2+EX∼pU(x)L(f(X),-1)=πSE(X,X')∼pS(x,x')L˜(f(X))+L˜(f(X'))2+E(X,X')∼p(x,x')L(f(X),-1)+L(f(X'),-1)2=E(X,X')∼p(x,x')L˜(f(X))+L˜(f(X'))2-πDE(X,X')∼pD(x,x')L˜(f(X))+L˜(f(X'))2+E(X,X')∼p(x,x')L(f(X),-1)+L(f(X'),-1)2=πDE(X,X')∼pD(x,x')-L˜(f(X))+L˜(f(X'))2+E(X,X')∼p(x,x')L(f(X),+1)+L(f(X'),+1)2=πDE(X,X')∼pD(x,x')-L˜(f(X))+L˜(f(X'))2+EX∼pU(x)L(f(X),+1)=RDU(f),RSU(f)=πSE(X,X')∼pS(x,x')L˜(f(X))+L˜(f(X'))2+EX∼pU(x)L(f(X),-1)=πSE(X,X')∼pS(x,x')L˜(f(X))+L˜(f(X'))2+E(X,X')∼p(x,x')L(f(X),-1)+L(f(X'),-1)2=πSE(X,X')∼pS(x,x')L˜(f(X))+L˜(f(X'))2+πSE(X,X')∼pS(x,x')L(f(X),-1)+L(f(X'),-1)2+πDE(X,X')∼pD(x,x')L(f(X),-1)+L(f(X'),-1)2=πSE(X,X')∼pS(x,x')L(f(X),-1)+L(f(X'),+1)2+πDE(X,X')∼pD(x,x')L(f(X),+1)+L(f(X'),-1)2=RSD(f).$
As shown in proposition 1, $RSU$ is an equivalent expression of the classification risk. Therefore, $RDU$ and $RSD$ are also equivalent expressions of the classification risk.$□$

### A.2  Proof of Theorem 2

We prove this theorem based on the positive semidefiniteness of the Hessian matrix similarly to SU classification in Bao et al. (2018). Since $ℓ$ is a twice-differentiable margin-based loss, there is a twice differentiable function $ψ:R→R≥0$ such that $ℓ(z,t)=ψ(tz)$. Here, our objective function, $J(w):=R^SDUγ(w)+λ2∥w∥2$, can be written as
$J(w)=λ2w⊤w-γ1πS2nS(π+-π-)∑i=12nSw⊤ϕ(x˜S,i)+γ2πD2nD(π+-π-)∑i=12nDw⊤ϕ(x˜D,i)+γ3πS2nS(π+-π-)∑i=12nSπ+ℓ(w⊤ϕ(x˜S,i),+1)-π-ℓ(w⊤ϕ(x˜S,i),-1)-γ3πD2nD(π+-π-)∑i=12nDπ-ℓ(w⊤ϕ(x˜D,i),+1)-π+ℓ(w⊤ϕ(x˜D,i),-1)+1nU(π+-π-)∑i=1nU(γ2π+-γ1π-)ℓ(w⊤ϕ(xU,i),+1)+(γ1π+-γ2π-)ℓ(w⊤ϕ(xU,i),-1).$
(A.4)
The second-order derivative of $ℓ(z,t)$ with respect to $z$ can be computed as
$∂2ℓ(z,t)∂z2=∂2ψ(tz)∂z2=t2∂2ψ(ξ)∂ξ2=∂2ψ(ξ)∂ξ2,$
(A.5)
where $ξ=tz$ is employed in the second equality and $t∈{+1,-1}$ is employed in the last equality. Here, the Hessian of $J(w)$ with respect to $w$ is
$HJ(w)=λI+∂2ψ(ξ)∂ξ2γ32nS∑i=12nSϕ(x˜S,i)ϕ(x˜S,i)⊤+γ32nD∑i=12nDϕ(x˜D,i)ϕ(x˜D,i)⊤+γ1+γ2nU∑i=1nUϕ(xU,i)ϕ(xU,i)⊤⪰0,$
(A.6)
where $A⪰0$ means that a matrix $A$ is positive semidefinite. Positive semidefiniteness of $HJ(w)$ follows from $∂2ψ(ξ)∂ξ2≥0$($∵$$ℓ$ is convex) and $ϕ(x˜)ϕ(x˜)⊤⪰0$. Therefore, $J(w)$ is convex with respect to $w.□$

### A.3  Proof of Theorem 3

We apply a similar technique with the SU classification method to the DU and SD classification methods. Using pointwise distributions defined in equations 3.12 and 3.13, we have the following lemma.

Lemma 1.
Assume that $π+≠12$. Given any function $f:X→R$, we denote $RS˜U$, $RD˜U$, and $RSD˜$ by
$RS˜U:=πSEX∼p˜S(x)L˜(f(X))+EX∼pU(x)L(f(X),-1),$
(A.7)
$RD˜U:=πDEX∼p˜D(x)-L˜(f(X))+EX∼pU(x)L(f(X),+1),$
(A.8)
$RSD˜:=πSEX∼p˜S(x)L(f(X),+1)+πDEX∼p˜D(x)L(f(X),-1).$
(A.9)
Then, $RS˜U$, $RD˜U$, and $RSD˜$ are equivalent to $RSU$, $RDU$, and $RSD$, respectively.
Here, empirical versions of the above risks are defined as
$R^S˜U:=πS2nS∑i=12nSL˜(f(x˜S,i))+1nU∑i=1nUL(fxU,i),-1),$
(A.10)
$R^D˜U:=-πD2nD∑i=12nDL˜(f(x˜D,i))+1nU∑i=1nUL(f(xU,i),+1),$
(A.11)
$R^SD˜:=πS2nS∑i=12nSL(f(x˜S,i),+1)+πD2nD∑i=12nDL(f(x˜D,i),-1).$
(A.12)
Note that these empirical risks are also equivalent to $R^SU$, $R^DU$, and $R^SD$. Now, we introduce the uniform deviation bound, which is useful to derive estimation error bounds. The proof can be found in the textbooks such as Mohri et al. (2012).
Lemma 2.
Let $Z$ be a random variable drawn from a probability distribution with density $μ$, $H={h:Z→[0,M]}(M>0)$ be a class of measurable functions, and ${zi}i=1n$ be i.i.d. samples drawn from the distribution with density $μ$. Then, for any $δ>0$, with the probability at least $1-δ$,
$suph∈HEZ∼μ[h(Z)]-1n∑i=1nh(zi)≤2R(H;μ,n)+M2log2δ2n.$
(A.13)
We can derive the estimation error bound for the SU classification method as
$R(f^SU)-R(f*)=RSU(f^SU)-RSU(f*)≤RSU(f^SU)-R^SU(f^SU)+R^SU(f*)-RSU(f*)≤2supf∈FRSU(f)-R^SU(f)=2supf∈FRS˜U(f)-R^S˜U(f)=2πSsupf∈FEX∼p˜SL˜(f(X))-12nS∑i=12nSL˜(f(x˜S,i))+2supf∈FEX∼pUL(f(X),-1)-1nU∑i=1nUL(f(xU,i),-1).$
(A.14)
In the same way, for DU and SD, we have
$R(f^DU)-R(f*)≤2πDsupf∈FEX∼p˜DL˜(f(X))-12nD∑i=12nDL˜(f(x˜D,i))+2supf∈FEX∼pUL(f(X),+1)-1nU∑i=1nUL(f(xU,i),+1),$
(A.15)
$R(f^SD)-R(f*)≤2πSsupf∈FEX∼p˜SL(f(X),+1)-12nS∑i=12nSL(f(x˜S,i),+1)+2πDsupf∈FEX∼p˜DL(f(X),-1)-12nD∑i=12nDL(f(x˜D,i),-1).$
(A.16)
To obtain the upper bound of the right-hand side for each algorithm, we derive the uniform deviation bound for $L˜(f(·))$ and $L(f(·),±1)$ as follows:
Lemma 3.
Assume that $π+≠12$, the loss function $ℓ$ is $ρ$-Lipschitz function with respect to the first argument ($0<ρ<∞$), and all functions in the model class $F$ are bounded, that is, there exists a constant $Cb$ such that $∥f∥∞≤Cb$ for any $f∈F$. Let $Cℓ:=supt∈{±1}ℓ(Cb,t)$ and ${xi}i=1n$ be i.i.d. samples drawn from a probability distribution with density $p$. For any $δ>0$, each of the following inequality holds with probability at least $1-δ$
$supf∈FEX∼pL˜(f(X))-1n∑i=1nL˜(f(xi))≤4ρCF+2Cℓ2log4δ|π+-π-|n,$
(A.17)
$supf∈FEX∼pL(f(X),+1)-1n∑i=1nL(f(xi),+1)≤2ρCF+12Cℓ2log4δ|π+-π-|n,$
(A.18)
$supf∈FEX∼pL(f(X),-1)-1n∑i=1nL(f(xi),-1)≤2ρCF+12Cℓ2log4δ|π+-π-|n.$
(A.19)
Proof.
$supf∈FEX∼pL˜(f(X))-1n∑i=1nL˜(f(xi))≤1|π+-π-|supf∈FEX∼pℓ(f(X),+1)-1n∑i=1nℓ(f(xi),+1)︸withtheprobabilityatleast1-δ/2+1|π+-π-|supf∈FEX∼pℓ(f(X),-1)-1n∑i=1nℓ(f(xi),-1)︸withtheprobabilityatleast1-δ/2≤1|π+-π-|4R(ℓ∘F;n,p)+2Cℓ2log4δn︸withtheprobabilityatleast1-δ,$
(A.20)
where $ℓ∘F$ indicates the class ${ℓ∘f∣f∈F}$. By applying Talagrand's lemma,
$R(ℓ∘F;n,p)≤ρR(F;n,p).$
(A.21)
With the assumption in equation 4.2, we obtain
$supf∈FEX∼pL˜(f(X))-1n∑i=1nL˜(f(xi))≤1|π+-π-|4ρCFn+2Cℓ2log4δn,=4ρCF+2Cℓ2log4δ|π+-π-|n.$
(A.22)
The bounds for $L(f(·),±1)$ can be proven similarly to $L˜(f(·))$.

By combining lemma 3 and equations A.14 to A.16, we complete the proof of theorem 3.$□$

### A.4  Proof of Theorem 5

Let $RSDUγ(f):=γ1RSU(f)+γ2RDU(f)+γ3RSD(f)$. We can rewrite this risk as follows:
$RSDUγ(f)=πSπ+-π-EX∼p˜S(x)[(γ1+γ3π+)ℓ(f(X),+1)-(γ1+γ3π-)ℓ(f(X),-1)]+πDπ+-π-EX∼p˜D(x)[-(γ2+γ3π-)ℓ(f(X),+1)+(γ2+γ3π+)ℓ(f(X),-1)]+1π+-π-EX∼pU(x)[(γ2π+-γ1π-)ℓ(f(X),+1)+(γ1π+-γ2π-)ℓ(f(X),-1)].$
(A.23)
Applying the uniform deviation bounds for each S, D, and U term as in theorem 3, theorem 5 can be proven.$□$

## Acknowledgments

H.B. was supported by JST ACT-I grant JPMJPR18UI. I.S. was supported by JST CREST grant JPMJCR17A1, Japan. M.S. was supported by JST CREST grant JPMJCR1403.

## References

Arora
,
S.
,
Khandeparkar
,
H.
,
Khodak
,
M.
,
Plevrakis
,
O.
, &
Saunshi
,
N.
(
2019
).
A theoretical analysis of contrastive unsupervised representation learning.
In
Proceedings of the 36th International Conference on Machine Learning
.
Bao
,
H.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2018
).
Classification from pairwise similarity and unlabeled data
. In
Proceedings of the 35th International Conference on Machine Learning
(p.
452
).
Bao
,
H.
,
,
T.
,
Xu
,
L.
,
Sato
,
I.
, &
Sugiyama
,
M.
(
2020
).
Similarity-based classification: Connecting similarity learning to binary classification
. arXiv:2006.06207.
Basu
,
S.
,
Banerjee
,
A.
, &
Mooney
,
R.
(
2002
).
Semi-supervised clustering by seeding
. In
Proceedings of 19th International Conference on Machine Learning
(p.
27
).
Basu
,
S.
,
Davidson
,
I.
, &
Wagstaff
,
K.
(
2008
).
Constrained clustering: Advances in algorithms, theory, and applications
.
Boca Raton, FL
:
CRC Press
.
Bilenko
,
M.
,
Basu
,
S.
, &
Mooney
,
R. J.
(
2004
).
Integrating constraints and metric learning in semi-supervised clustering
. In
Proceedings of the 21st International Conference on Machine Learning
(p.
839
).
Chang
,
C.-C.
&
Lin
,
C.-J.
(
2011, May
).
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology
, art. 27.
Chapelle
,
O.
,
Schölkopf
,
B.
, &
Zien
,
A.
(
2010
).
Semi-supervised learning
.
Cambridge, MA
:
MIT Press
.
Charoenphakdee
,
N.
,
Lee
,
J.
, &
Sugiyama
,
M.
(
2019
).
On symmetric losses for learning from corrupted labels
. In
Proceedings of the 36th International Conference on Machine Learning
(p.
961
).
Chen
,
W.
, &
Feng
,
G.
(
2012
).
Spectral clustering: A semi-supervised approach
.
Neurocomputing
,
77
,
229
242
.
Chiang
,
K.-Y.
,
Hsieh
,
C.-J.
, &
Dhillon
,
I. S.
(
2015
). Matrix completion with noisy side information. In
C.
Cortes
,
N.
Lawrence
,
D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
, 28 (pp.
3447
3455
).
Red Hook, NY
:
Curran
.
Chopra
,
S.
,
,
R.
, &
LeCun
,
Y.
(
2005
).
Learning a similarity metric discriminatively, with application to face verification
. In
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(pp.
539
546
).
Piscataway, NJ
:
IEEE
.
Cui
,
Z.
,
Charoenphakdee
,
N.
,
Sato
,
I.
, &
Sugiyama
,
M.
(
2020
).
Classification from triplet comparison data
.
Neural Computation
,
32
(
3
),
659
681
.
Davis
,
J. V.
,
Kulis
,
B.
,
Jain
,
P.
,
Sra
,
S.
, &
Dhillon
,
I. S.
(
2007
).
Information-theoretic metric learning.
In
Proceedings of the 24th International Conference on Machine Learning
(pp.
209
216
).
du
Plessis
,
M. C.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2014
). Analysis of learning from positive and unlabeled data. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
703
711
).
Red Hook, NY
:
Curran
.
du Plessis
,
M. C.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2015
).
Convex formulation for learning from positive and unlabeled data
. In
Proceedings of the 32nd International Conference on Machine Learning
(pp.
1386
1394
).
Dua
,
D.
, &
Graff
,
C.
(
2017
).
UCI machine learning repository
.
Ghosh
,
A.
,
Manwani
,
N.
, &
Sastry
,
P.
(
2015
).
Making risk minimization tolerant to label noise
.
Neurocomputing
,
160
,
93
107
.
,
R.
,
Chopra
,
S.
, &
LeCun
,
Y.
(
2006
).
Dimensionality reduction by learning an invariant mapping
. In
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(pp.
1735
1742
).
Piscataway, NJ
: IEEE.
Hjelm
,
R. D.
,
Fedorov
,
A.
,
Lavoie-Marchildon
,
S.
,
Grewal
,
K.
,
Bachman
,
P.
,
Trischler
,
A.
, &
Bengio
,
Y.
(
2019
).
Learning deep representations by mutual information estimation and maximization
. In
Proceedings of the International Conference on Learning Representations
.
:
Omnipress
.
Hsu
,
Y.-C.
,
Lv
,
Z.
,
Schlosser
,
J.
,
Odom
,
P.
, &
Kira
,
Z.
(
2019
).
Multi-class classification without multi-class labels
. In
Proceedings of the International Conference on Learning Representations
.
:
Omnipress
.
Hu
,
Y.
,
Wang
,
J.
,
Yu
,
N.
, &
Hua
,
X.-S.
(
2008
).
Maximum margin clustering with pairwise constraints
. In
Proceedings of the Eighth IEEE International Conference on Data Mining
(pp.
253
262
).
Piscataway, NJ
:
IEEE
.
Ishida
,
T.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2018
). Binary classification from positive- confidence data. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
5917
5928
).
Red Hook, NY
:
Curran
.
Kiros
,
R.
,
Zhu
,
Y.
,
Salakhutdinov
,
R. R.
,
Zemel
,
R.
,
Urtasun
,
R.
,
Torralba
,
A.
, &
Fidler
,
S.
(
2015
). Skip-thought vectors. In
C.
Cortes
,
N.
Lawrence
,
D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
3294
3302
).
Red Hook, NY
:
Curran
.
Klein
,
D.
,
Kamvar
,
S. D.
, &
Manning
,
C. D.
(
2002
).
From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering
. In
Proceedings of the 19th International Conference on Machine Learning
(pp.
307
314
).
Li
,
Z.
, &
Liu
,
J.
(
2009
).
Constrained clustering by spectral kernel learning
. In
Proceedings of the IEEE 12th International Conference on Computer Vision
(pp.
421
427
).
Piscataway, NJ
:
IEEE
.
Logeswaran
,
L.
, &
Lee
,
H.
(
2018
).
An efficient framework for learning sentence representations
. In
Proceedings of the International Conference on Learning Representations
.
:
Omnipress
.
Lu
,
N.
,
Niu
,
G.
,
Menon
,
A. K.
, &
Sugiyama
,
M.
(
2019
).
On the minimal supervision for training any binary classifier from only unlabeled data
. In
International Conference on Learning Representations
.
:
Omnipress
.
MacQueen
,
J.
(
1967
).
Some methods for classification and analysis of multivariate observations
. In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability
(pp.
281
297
).
Berkeley
:
University of California Press
.
Mikolov
,
T.
,
Sutskever
,
I.
,
Chen
,
K.
,
,
G. S.
, &
Dean
,
J.
(
2013
). Distributed representations of words and phrases and their compositionality. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
3111
3119
).
Red Hook, NY
:
Curran
.
Mohri
,
M.
,
,
A.
,
Bach
,
F.
, &
Talwalkar
,
A.
(
2012
).
Foundations of machine learning
.
Cambridge, MA
:
MIT Press
.
Nederhof
,
A. J.
(
1985
).
Methods of coping with social desirability bias: A review
.
European Journal of Social Psychology
,
15
(
3
),
263
280
.
Niu
,
G.
,
Dai
,
B.
,
,
M.
, &
Sugiyama
,
M.
(
2012
).
Information-theoretic semisupervised metric learning via entropy regularization
. In Proceedings of the 29th International Conference on Machine Learning (pp.
89
96
).
Okamoto
,
M.
(
1959
).
Some inequalities relating to the partial sum of binomial probabilities
.
Annals of the Institute of Statistical Mathematics
,
10
(
1
),
29
35
.
Oord
,
A. v. d.
,
Li
,
Y.
, &
Vinyals
,
O.
(
2018
).
Representation learning with contrastive predictive coding
. arXiv:1807.03748.
Patrini
,
G.
,
Nielsen
,
F.
,
Nock
,
R.
, &
Carioni
,
M.
(
2016
).
Loss factorization, weakly supervised learning and label noise robustness
. In
Proceedings of the 33rd International Conference on Machine Learning
(pp.
708
717
).
Peters
,
M. E.
,
Neumann
,
M.
,
Iyyer
,
M.
,
Gardner
,
M.
,
Clark
,
C.
,
Lee
,
K.
, &
Zettlemoyer
,
L.
(
2018
).
Deep contextualized word representations
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
(pp.
2227
2237
).
Stroudsburg, PA
:
ACL
.
Sakai
,
T.
,
du Plessis
,
M. C.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2017
).
Semi-supervised classification based on classification from positive and unlabeled data
. In Proceedings of the 34th International Conference on Machine Learning (pp.
2998
3006
).
Schroff
,
F.
,
Kalenichenko
,
D.
, &
Philbin
,
J.
(
2015
).
Facenet: A unified embedding for face recognition and clustering
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
815
823
).
Piscataway, NJ
:
IEEE
.
Sohn
,
K.
(
2016
). Improved deep metric learning with multi-class n-pair loss objective. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
, 29 (pp.
1857
1865
).
Red Hook, NY
:
Curran
.
Wagstaff
,
K.
,
Cardie
,
C.
,
Rogers
,
S.
, &
Schrödl
,
S.
(
2001
).
Constrained K-means clustering with background knowledge
. In
Proceedings of the 18th International Conference on Machine Learning
(pp.
577
584
).
Warner
,
S. L.
(
1965
).
Randomized response: A survey technique for eliminating evasive answer bias
.
Journal of the American Statistical Association
,
60
,
63
69
.
Weinberger
,
K. Q.
, &
Saul
,
L. K.
(
2009
).
Distance metric learning for large margin nearest neighbor classification
.
Journal of Machine Learning Research
,
10
,
207
244
.
Wu
,
S.
,
Xia
,
X.
,
Liu
,
T.
,
Han
,
B.
,
Gong
,
M.
,
Wang
,
N.
, …
Niu
,
G.
(
2020
).
Class2Simi: A new perspective on learning with label noise
. arXiv:2006.07831.
Xing
,
E. P.
,
Jordan
,
M. I.
,
Russell
,
S. J.
, &
Ng
,
A. Y.
(
2003
). Distance metric learning with application to clustering with side-information. In
S.
Becker
,
S.
Thrun
, &
K.
Overmayer
(Eds.),
Advances in neural information processing systems
,
15
(pp.
521
528
).
Cambridge, MA
:
MIT Press
.
Yan
,
R.
,
Zhang
,
J.
,
Yang
,
J.
, &
Hauptmann
,
A. G.
(
2006
).
A discriminative learning framework with pairwise constraints for video object classification
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
28
(
4
),
578
593
.
Yi
,
J.
,
Zhang
,
L.
,
Jin
,
R.
,
Qian
,
Q.
, &
Jain
,
A.
(
2013
).
Semi-supervised clustering by input pattern assisted pairwise similarity matrix completion
. In
Proceedings of the 30th International Conference on Machine Learning
(pp.
1400
1408
).
Zhang
,
J.
, &
Yan
,
R.
(
2007
).
On the value of pairwise constraints in classification and consistency
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
1111
1118
).

## Author notes

T.S. is now with Preferred Networks, Japan.