## Abstract

Pairwise similarities and dissimilarities between data points are often obtained more easily than full labels of data in real-world classification problems. To make use of such pairwise information, an empirical risk minimization approach has been proposed, where an unbiased estimator of the classification risk is computed from only pairwise similarities and unlabeled data. However, this approach has not yet been able to handle pairwise dissimilarities. Semisupervised clustering methods can incorporate both similarities and dissimilarities into their framework; however, they typically require strong geometrical assumptions on the data distribution such as the manifold assumption, which may cause severe performance deterioration. In this letter, we derive an unbiased estimator of the classification risk based on all of similarities and dissimilarities and unlabeled data. We theoretically establish an estimation error bound and experimentally demonstrate the practical usefulness of our empirical risk minimization method.

## 1 Introduction

In supervised classification, we need a vast amount of labeled data to train our classifiers. However, it is often not easy to obtain such labels due to high labeling costs (Chapelle, Schölkopf, & Zien, 2010), privacy concerns (Warner, 1965), and social bias (Nederhof, 1985). In real-world classification problems, pairwise similarities (i.e., pairs of samples in the same class) and pairwise dissimilarities (i.e., pairs of samples in different classes) are often collected more easily than full labels of data. For example, in protein function prediction, the knowledge about similarities and dissimilarities can be obtained by experimental means as additional supervision (Klein, Kamvar, & Manning, 2002). In video object classification, knowledge of temporal relations can be used to generate pairwise labels in an algorithmic way; for example, an object staying in temporally adjacent frames must be the same, and two objects in the same frame must be different (Yan, Zhang, Yang, & Hauptmann, 2006; Zhang & Yan, 2007). To make use of such pairwise information, similar-unlabeled (SU) classification (Bao, Niu, & Sugiyama, 2018) has been proposed, where the classification risk is estimated in an unbiased fashion from only similar pairs and unlabeled data. Although their method can handle only similar data and unlabeled data, we may also obtain dissimilar pairs in practice. In such a case, we can expect that the use of dissimilarities, in addition to similarities and unlabeled data, improves the classification accuracy.

Semisupervised clustering (Wagstaff, Cardie, Rogers, & Schrödl, 2001) is a method that can incorporate both similar and dissimilar pairs into their framework, where must-link pairs (i.e., similar pairs) and cannot-link pairs (i.e., dissimilar pairs) are used to obtain meaningful clusters. Existing literature provides useful semisupervised clustering methods based on the ideas that (1) must/cannot-links are treated as constraints (Basu, Banerjee, & Mooney, 2002; Wagstaff et al., 2001; Li & Liu, 2009; Hu, Wang, Yu, & Hua, 2008), (2) clustering is performed with metrics learned by semisupervised metric learning (Xing, Jordan, Russell, & Ng, 2003; Bilenko, Basu, & Mooney, 2004; Weinberger & Saul, 2009; Davis, Kulis, Jain, Sra, & Dhillon, 2007; Niu, Dai, Yamada, & Sugiyama, 2012), and (3) missing links are predicted by matrix completion (Yi, Zhang, Jin, Qian, & Jain, 2013; Chiang, Hsieh, & Dhillon, 2015). However, the motivation of clustering, finding a meaningful cluster structure, is different from that of classification, finding a classifier that allows prediction of labels for unseen data. Therefore, applying a semisupervised clustering method to classification does not necessarily give an appropriate solution. For example, most of the semisupervised clustering methods rely on geometrical or margin-based assumptions such as the cluster assumption and manifold assumption (Basu, Davidson, & Wagstaff, 2008), and without such assumptions, semisupervised clustering methods do not work well. In addition, the objective of semisupervised clustering is usually not the minimization of the classification risk, which may lead to suboptimal performance in terms of the classification accuracy.

In contrast, more and more discriminative training approaches have been studied recently. One is contrastive representation learning (Chopra, Hadsell, & LeCun, 2005; Hadsell, Chopra, & LeCun, 2006; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013; Kiros et al., 2015; Sohn, 2016; Logeswaran & Lee, 2018; Peters et al., 2018; Oord, Li, & Vinyals, 2018; Hjelm et al., 2019; Arora, Khandeparkar, Khodak, Plevrakis, & Saunshi, 2019), which tries to obtain good data representations by bringing an anchor data point close to a given similar data point (a positive sample) and far from randomly sampled data points (negative samples). Resulting representations can be used for downstream classification. The other is the meta-classification approach (Hsu, Lv, Schlosser, Odom, & Kira, 2019; Wu et al., 2020), which performs the maximum likelihood estimation of similar and dissimilar data points. The likelihood is modeled with the inner product between two logits, and the individual logit models are expected to perform well on classification of single data points. While both approaches incorporate similar and dissimilar data points into their formulations, it is not clear whether these methods perform good classification performance from the theoretical perspective; indeed, their objective functions have not been directly connected to the classification risk.

In this letter, we propose a similar-dissimilar-unlabeled (SDU) classification method, where all of pairwise similarities and dissimilarities and unlabeled data serve for unbiased estimation of the classification risk. Like the SU classification method (Bao et al., 2018), our method does not require any geometrical assumptions on the data distribution and enables us to minimize the classification risk via empirical risk minimization. As preparation for constructing our SDU classification, we first develop a dissimilar-unlabeled (DU) classification method and a similar-dissimilar (SD) classification method, where only dissimilar and unlabeled data or similar and dissimilar data are required, respectively. We also show that these methods can be regarded as a special case of a very general framework of classification from unlabeled data (Lu, Niu, Menon, & Sugiyama, 2019). Then we combine the three risks in SU, DU, and SD classification, in a similar manner to positive-negative-unlabeled classification (Sakai, du Plessis, Niu, & Sugiyama, 2017) and train a classifier based on empirical risk minimization at last. We further propose a strategy to reduce the computation cost of hyperparameter tuning by ignoring the SU risk and combine the SD and DU risks for estimation of the classification risk. This strategy comes from the analysis of estimation error bounds for each of SU, DU, and SD classification methods; the bounds for DU/SD classification methods tend to be tighter than the bound of SD classification method. Finally, we experimentally demonstrate the practical usefulness of our method.

Our contributions can be summarized as follows:

- •
We develop DU and SD classification methods by extending the SU classification method and propose an SDU classification method as a general form of those methods (see section 3).

- •
We establish estimation error bounds for each method and confirm that unlabeled data help the estimation of the classification risk (see sections 4.1 and 4.3).

- •
From theoretical analysis, we find that estimation error bounds for the DU/SD classification methods tend to be tighter than that for the SU classification method and propose a strategy to reduce the computation cost in the SDU classification method (see section 4.2).

## 2 Preliminary

In this section, we introduce our problem setting and a generation model of pairwise similarities and dissimilarities and unlabeled data. Thereafter, we review the existing SU classification method.

### 2.1 Problem Setting

Let $X\u2282Rd$ and $Y={+1,-1}$ be a $d$-dimensional example space and binary label space, respectively. Suppose that each labeled example $(x,y)\u2208X\xd7Y$ is generated independently from the joint probability distribution with density $p(x,y)$. For simplicity, let $\pi +$ and $\pi -$ be class priors $p(y=+1)$ and $p(y=-1)$, which satisfy the condition $\pi ++\pi -=1$, and $p+(x)$ and $p-(x)$ be class-conditional densities $p(x\u2223y=+1)$ and $p(x\u2223y=-1)$.

where $E(X,Y)\u223cp(x,y)\xb7$ denotes the expected value with respect to $(X,Y)$ over the joint density $p(x,y)$ and $\u2113:R\xd7Y\u2192R\u22650$ is a loss function.

### 2.2 Generation Model of Training Data

### 2.3 SU Classification

In the seminal paper by Bao et al. (2018), the first method of SU classification was proposed, where the classification risk is estimated in an unbiased fashion from only similar pairs and unlabeled data as follows:

**Proposition 1**

## 3 Proposed Method

In this section, we propose an SDU classification method, where the classification risk is estimated from pairwise similarities and dissimilarities and unlabeled data. As the first step to construct our method, we extend the SU classification method to DU and SD classification methods.

### 3.1 DU and SD Classification

As well as the SU classification method, the classification risk can be estimated from only dissimilar pairs and unlabeled data (DU), or similar pairs and dissimilar pairs (SD) as follows.

**Theorem 1.**

#### 3.1.1 Interpretation of SD Risk

Furthermore, when loss $\u2113$ is symmetric (Ghosh, Manwani, & Sastry, 2015; Charoenphakdee, Lee, & Sugiyama, 2019), that is, for some $K\u2208R,\u2113(z,+1)+\u2113(z,-1)=K$, then the following relationship holds:

**Corollary 1.**

**Proof.**

which results in equation 3.7.

If we use a symmetric loss in the classification risk, corollary 1 gives us practical advantages. For instance, when writing program code for a training algorithm, we do not have to implement loss $L$ by ourselves. Instead, we can treat “similar” and “dissimilar” labels as positive and negative labels associated with each point in a pair and use any standard binary classification algorithm with a loss function $\u2113$.

#### 3.1.2 Relation to UU Classification

**Proposition 2**

### 3.2 SDU Classification

### 3.3 Practical Implementation

**Theorem 2.**

Several loss functions that satisfy the conditions in theorem 2 are shown in Table 1, borrowed from Patrini, Nielsen, Nock, and Carioni (2016) and Bao, Niu, & Sugiyama (2018). Next, we consider the optimization problem with the squared loss and the double hinge loss, respectively.

Loss Name . | $\psi (tz)$ . |
---|---|

Squared loss | $14(tz-1)2$ |

Logistic loss | $log(1+exp(-tz))$ |

Double hinge loss | $max(-m,max(0,12-12tz))$ |

Loss Name . | $\psi (tz)$ . |
---|---|

Squared loss | $14(tz-1)2$ |

Logistic loss | $log(1+exp(-tz))$ |

Double hinge loss | $max(-m,max(0,12-12tz))$ |

#### 3.3.1 Squared Loss

#### 3.3.2 Double Hinge Loss

### 3.4 Class Prior Estimation from Pairwise Data

## 4 Theoretical Analysis

In this section, we analyze estimation error bounds for our methods. We first derive estimation error bounds for the SU, DU, and SD classification methods via Rademacher complexity. By comparing these bounds, we find a nontrivial relationship in the performances of these methods. It also gives a strategy to reduce the cost of hyperparameter tuning in the SDU classification method. Finally, we derive an estimation error bound for the SDU classification method.

### 4.1 Estimation Error Bounds for SU, DU, and SD Classification

We investigate estimation error bounds for the SU, DU, and SD classification methods. Let $F\u2282RX$ be a function class of the specified model:

**Definition 1**

**Theorem 3.**

### 4.2 Comparison of SU, DU, and SD Bounds

Here, we compare the SU, DU, and SD classification methods from the perspective of their estimation error bounds. Under the generation process of similar and dissimilar pairs in equation 2.2, we have the following claim:

**Theorem 4.**

Suppose similar and dissimilar pairs follow the generation process in equation 2.2. We denote each right-hand side in equations 4.3 to 4.5 by $VSD$, $VSU$, and $VDU$, respectively. Then, $VDU\u2264VSU$ and $VSD\u2264VSU$ hold with the probability at least $1-exp(-cnSD)$ for some constant $c>0$.

**Proof.**

#### 4.2.1 SDDU Classification for Efficient Hyperparameter Search

Theorem 4 states that $max{VSD,VDU}\u2264VSD$ holds with high probability when $nSD$ is sufficiently large. It suggests that both DU and SD classification methods are likely to outperform the SU classification method when all of pairwise similarities and dissimilarities and unlabeled data are given in advance. Inspired by this result, we propose a strategy to reduce the computation cost by fixing $\gamma 1=0$ in equation 3.15, that is, the classification risk is always estimated with the DU and SD risks. We call this method the *SDDU classification method* to distinguish it from the general SDU classification method. In sections 5.2 and 5.3, we experimentally demonstrate that the SDDU classification method performs at the same level as or better than the SDU classification method.

### 4.3 Estimation Error Bound for SDU Classification

We derive an estimation error bound for the SDU classification method. With the same technique as in theorem 3, we have the following bound:

**Theorem 5.**

Theorem 5 ensures that the estimation error bound of $f^SDU$ diminishes asymptotically, that is, $R(f*)-R(f^SDU)\u21920$ as $nS,nD,nU\u2192\u221e$. As a negative aspect, it should also be noted that $CF,\u2113,\delta '$ is inversely proportional to $|\pi +-\pi -|$, which implies that the estimation error can increase as $\pi +$ and $\pi -$ are approaching each other.

## 5 Experiments

In this section, we experimentally investigate the behavior of the proposed methods on benchmark data sets. First, we compare the performances of the SU, DU, and SD classification methods to confirm that the SD and DU classification methods are likely to perform better than the SU classification method, as discussed in section 4.2. Second, we demonstrate that unlabeled data can improve the classification accuracy of the SDU classification method. Finally, we compare the performance of the SDU classification method and those of baseline methods.

We conducted experiments on 10 benchmark data sets obtained from the UCI Machine Learning Repository (Dua & Graff, 2017) and LIBSVM (Chang & Lin, 2011). To obtain pairwise training data, we first converted pointwise labeled data into pairs by coupling. Then we randomly subsampled pairwise similar and dissimilar data following the ratio of $\pi S$ and $\pi D$. To obtain unlabeled data, we randomly picked positive and negative data following the ratio of $\pi +$ and $\pi -$. The labeled data for testing are created in the same way as with unlabeled data, and the number of test data was set to 500.

In the SDU classification method, including the SU, DU, and SD classification methods, a linear-in-input model $f(x)=w\u22a4x+b$ is used as a classifier. The weight of $L2$ regularization was chosen from ${10-1,10-4,10-7}$. Each of the coefficient parameters in $(\gamma 1,\gamma 2,\gamma 3)$ was chosen from ${0,13,23,1}$ subject to $\gamma 1+\gamma 2+\gamma 3=1$. All hyperparameters were tuned with 5-fold cross validation on the empirical classification error computed from similarities and dissimilarities, that is, $R^SD$ equipped with the zero-one loss. The squared loss is used for experiments in sections 5.1 and 5.2, and the double hinge loss is used in section 5.3. We assumed that the true positive class proportion $\pi +$ is known for computing the empirical risk.

### 5.1 Comparison of SU, DU, and SD Performances

We compared the performances of the SU, DU, and SD classification methods. We set the number of unlabeled training data to 500 and the number of pairwise training data to each of {50, 100, 200, 300, 400, 500}. Training and test data were generated with maintaining $\pi +=0.7$. The misclassification rates for each method are plotted in Figure 2.

### 5.2 Performance Improvement with Unlabeled Data

We investigated the effect of unlabeled data in the SDU classification method. The number of pairwise data was fixed: $nSD=50$. As with the previous experiment, training and test data were generated with maintaining $\pi +=0.7$. Three methods, the SD, SDU, and SDDU classification methods, are evaluated in each setting. The misclassification rates for each method are plotted in Figure 3.

### 5.3 Benchmark Comparison of SDU and Existing Methods

We evaluated the performances of the SDU/SDDU classification methods and six baseline methods on benchmark data sets. We set $nU=500$ and $nSD={50,200}$. In each trial, the misclassification rate was measured with 500 test examples. To see the influence of the class prior on our methods, we conducted experiments in a moderately imbalanced case ($\pi +=0.7$) and a fairly imbalanced case ($\pi +=0.9$), respectively. We report the results for each setup in Table 2. The details of the baseline methods are described below.

(a) $nSD=50,\pi +=0.7$ . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

. | . | Proposed . | Baselines . | ||||||||

Data Set . | # Dim. . | SDU . | SDDU . | SU . | KM . | CKM . | SSP . | ITML . | CRL . | OVPC . | MCL . |

adult | 123 | 26.4 (1.12) | 23.6 (0.83) | 35.2 (1.03) | 35.0 (0.81) | 33.4 (1.05) | 30.6 (0.29) | 38.2 (0.64) | 38.3 (0.99) | 41.4 (0.87) | 26.0 (1.10) |

banana | 2 | 33.9 (0.73) | 33.5 (0.68) | 35.7 (0.72) | 47.1 (0.36) | 47.3 (0.35) | 41.3 (0.73) | 47.1 (0.36) | 43.7 (0.63) | 38.0 (0.98) | 33.0 (0.77) |

codrna | 8 | 20.1 (1.14) | 18.9 (1.17) | 24.6 (1.01) | 37.4 (0.50) | 38.5 (0.40) | 45.4 (0.99) | 37.4 (0.50) | 41.0 (0.66) | 37.1 (0.99) | 32.1 (1.20) |

ijcnn1 | 22 | 33.0 (0.73) | 32.2 (0.61) | 36.5 (0.96) | 44.5 (0.63) | 45.3 (0.50) | 40.0 (0.79) | 44.9 (0.71) | 44.9 (0.60) | 42.1 (0.89) | 32.9 (1.13) |

magic | 10 | 34.5 (0.83) | 34.2 (0.87) | 40.0 (0.82) | 47.6 (0.21) | 48.2 (0.18) | 47.3 (0.27) | 47.6 (0.21) | 42.2 (0.52) | 43.2 (0.74) | 29.9 (1.20) |

phishing | 68 | 22.1 (1.21) | 21.4 (1.11) | 27.3 (1.30) | 37.4 (0.33) | 37.4 (0.31) | 31.9 (0.30) | 37.4 (0.34) | 33.8 (1.09) | 39.7 (0.81) | 17.8 (1.58) |

phoneme | 5 | 29.6 (0.78) | 29.2 (0.79) | 32.7 (0.98) | 32.2 (0.34) | 31.1 (0.45) | 33.5 (1.01) | 32.2 (0.34) | 35.5 (1.01) | 37.8 (0.92) | 32.1 (0.91) |

spambase | 57 | 21.0 (1.40) | 20.4 (1.27) | 31.7 (1.71) | 36.3 (1.08) | 35.8 (1.06) | 29.6 (0.31) | 39.1 (1.10) | 37.1 (1.14) | 39.8 (0.92) | 14.3 (0.83) |

w8a | 300 | 36.2 (1.26) | 33.0 (1.19) | 41.3 (0.80) | 30.7 (0.26) | 34.0 (0.67) | 36.0 (0.69) | 31.1 (0.29) | 31.7 (0.36) | 43.4 (0.75) | 39.3 (1.07) |

waveform | 21 | 17.3 (0.98) | 15.8 (0.88) | 26.7 (1.48) | 48.5 (0.17) | 48.4 (0.18) | 46.7 (0.35) | 48.5 (0.17) | 30.6 (1.66) | 36.4 (1.43) | 12.2 (0.31) |

# Outperform | 4 | 5 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 6 | |

(b) $nSD=50,\pi +=0.9([[$ | |||||||||||

Proposed | Baselines | ||||||||||

Data Set | # Dim. | SDU | SDDU | SU | KM | CKM | SSP | ITML | CRL | OVPC | MCL |

adult | 123 | 9.8 (0.43) | 9.7 (0.41) | 23.7 (0.66) | 22.0 (1.55) | 33.7 (1.42) | 11.5 (0.26) | 28.5 (1.46) | 41.7 (0.79) | 38.5 (0.94) | 14.6 (1.25) |

banana | 2 | 10.4 (0.23) | 10.4 (0.22) | 12.7 (0.63) | 45.5 (0.31) | 46.0 (0.30) | 31.7 (1.47) | 45.4 (0.31) | 39.3 (0.80) | 23.9 (1.97) | 10.7 (0.32) |

codrna | 8 | 6.9 (0.39) | 6.9 (0.45) | 15.8 (0.70) | 32.9 (1.12) | 36.6 (1.05) | 42.3 (1.93) | 32.8 (1.13) | 38.0 (1.54) | 33.1 (1.65) | 10.0 (0.22) |

ijcnn1 | 22 | 10.0 (0.31) | 9.9 (0.31) | 14.3 (0.62) | 40.0 (1.00) | 41.0 (0.81) | 29.4 (1.34) | 39.1 (1.12) | 41.1 (0.73) | 36.7 (1.55) | 14.0 (1.32) |

magic | 10 | 11.7 (0.33) | 11.6 (0.29) | 21.8 (0.85) | 36.4 (0.45) | 38.8 (0.38) | 45.6 (0.44) | 36.4 (0.45) | 29.5 (1.15) | 41.1 (0.99) | 14.5 (0.96) |

phishing | 68 | 8.7 (0.38) | 8.5 (0.32) | 18.0 (0.85) | 24.7 (0.36) | 25.7 (0.38) | 13.6 (0.41) | 24.8 (0.36) | 33.1 (1.35) | 40.5 (0.94) | 9.2 (0.94) |

phoneme | 5 | 11.1 (0.27) | 11.3 (0.27) | 15.8 (0.67) | 40.4 (0.51) | 40.8 (0.35) | 26.7 (1.76) | 40.1 (0.56) | 36.2 (1.39) | 31.3 (1.62) | 11.7 (0.85) |

spambase | 57 | 8.8 (0.23) | 8.5 (0.24) | 16.7 (0.53) | 18.6 (1.48) | 32.3 (1.12) | 11.6 (0.34) | 21.7 (1.46) | 33.2 (1.88) | 40.0 (0.98) | 8.1 (0.78) |

w8a | 300 | 8.3 (0.50) | 8.3 (0.50) | 26.0 (0.56) | 11.2 (0.19) | 11.7 (0.22) | 18.3 (1.00) | 11.7 (0.26) | 14.6 (0.80) | 35.8 (1.23) | 27.4 (1.52) |

waveform | 21 | 5.2 (0.24) | 5.1 (0.24) | 8.2 (0.73) | 48.6 (0.17) | 48.5 (0.18) | 47.6 (0.23) | 48.6 (0.17) | 44.1 (0.71) | 35.1 (1.52) | 5.4 (0.74) |

# Outperform | 10 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | |

adult | 123 | 18.0 (0.46) | 17.8 (0.31) | 21.3 (0.54) | 36.7 (0.76) | 28.1 (0.90) | 30.7 (0.31) | 39.1 (0.64) | 33.0 (0.95) | 42.7 (0.79) | 19.7 (0.39) |

banana | 2 | 30.5 (0.42) | 31.0 (0.42) | 32.8 (0.52) | 47.5 (0.24) | 47.5 (0.24) | 33.5 (1.29) | 47.5 (0.24) | 42.3 (0.50) | 35.2 (0.77) | 32.0 (0.67) |

codrna | 8 | 10.8 (0.69) | 9.4 (0.41) | 18.1 (0.88) | 37.2 (0.51) | 40.5 (0.50) | 46.8 (0.72) | 37.4 (0.54) | 36.6 (1.33) | 27.6 (1.28) | 10.5 (0.79) |

ijcnn1 | 22 | 23.3 (0.44) | 22.6 (0.40) | 28.4 (0.64) | 45.8 (0.29) | 46.9 (0.35) | 40.5 (0.78) | 46.4 (0.35) | 44.3 (0.66) | 43.1 (0.81) | 16.2 (0.64) |

magic | 10 | 25.9 (0.65) | 25.6 (0.55) | 30.2 (0.71) | 48.0 (0.18) | 48.3 (0.17) | 47.4 (0.29) | 48.0 (0.18) | 40.3 (0.59) | 43.4 (0.77) | 22.1 (0.68) |

phishing | 68 | 12.0 (0.42) | 12.0 (0.41) | 17.2 (0.87) | 37.4 (0.31) | 37.2 (0.30) | 31.6 (0.29) | 37.4 (0.31) | 22.8 (1.34) | 42.9 (0.67) | 7.0 (0.18) |

phoneme | 5 | 25.5 (0.49) | 25.5 (0.49) | 27.5 (0.67) | 32.2 (0.31) | 29.0 (0.60) | 28.0 (0.73) | 32.0 (0.37) | 32.9 (1.07) | 37.1 (0.90) | 25.5 (0.51) |

spambase | 57 | 12.8 (0.27) | 12.3 (0.23) | 16.2 (0.62) | 38.2 (1.20) | 29.6 (0.56) | 29.4 (0.29) | 40.4 (1.14) | 34.8 (1.34) | 39.6 (1.05) | 9.5 (0.20) |

w8a | 300 | 20.4 (0.80) | 18.8 (0.69) | 35.8 (0.71) | 30.7 (0.27) | 43.7 (0.61) | 32.6 (0.60) | 31.7 (0.33) | 33.2 (0.69) | 45.1 (0.57) | 27.6 (0.76) |

waveform | 21 | 12.8 (0.29) | 12.6 (0.29) | 15.6 (0.59) | 48.5 (0.16) | 48.5 (0.14) | 46.9 (0.31) | 48.4 (0.16) | 17.5 (1.40) | 35.3 (1.26) | 10.7 (0.19) |

# Outperform | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | |

(d) $nSD=200,\pi +=0.9([[[$ | |||||||||||

Proposed | Baselines | ||||||||||

Data Set | # Dim. | SDU | SDDU | SU | KM | CKM | SSP | ITML | CRL | OVPC | MCL |

adult | 123 | 8.4 (0.24) | 8.3 (0.24) | 11.2 (0.32) | 27.4 (1.41) | 43.6 (0.95) | 11.1 (0.27) | 27.8 (1.25) | 43.3 (0.76) | 39.3 (1.11) | 9.0 (0.21) |

banana | 2 | 10.2 (0.19) | 10.2 (0.19) | 10.5 (0.24) | 45.5 (0.30) | 46.6 (0.32) | 25.8 (1.77) | 45.5 (0.29) | 40.0 (0.81) | 24.1 (1.63) | 10.2 (0.19) |

codrna | 8 | 4.1 (0.18) | 4.0 (0.19) | 9.8 (0.39) | 32.2 (1.10) | 40.4 (0.79) | 40.7 (2.11) | 32.5 (1.17) | 38.1 (1.16) | 29.1 (1.53) | 7.6 (0.20) |

ijcnn1 | 22 | 8.4 (0.20) | 8.3 (0.20) | 9.4 (0.24) | 40.4 (0.82) | 43.1 (0.63) | 27.7 (1.38) | 41.1 (0.90) | 41.2 (1.00) | 38.9 (1.26) | 7.7 (0.16) |

magic | 10 | 10.3 (0.23) | 10.2 (0.22) | 16.8 (0.73) | 37.0 (0.33) | 41.4 (0.34) | 45.2 (0.43) | 37.0 (0.32) | 32.7 (1.50) | 38.6 (1.35) | 10.0 (0.19) |

phishing | 68 | 6.3 (0.21) | 6.3 (0.22) | 8.9 (0.38) | 24.4 (0.26) | 27.6 (0.33) | 13.7 (0.38) | 24.5 (0.28) | 38.4 (1.22) | 40.8 (0.81) | 3.7 (0.13) |

phoneme | 5 | 10.3 (0.21) | 10.3 (0.19) | 12.4 (0.41) | 40.2 (0.54) | 40.5 (0.39) | 24.9 (1.70) | 40.3 (0.54) | 33.2 (1.53) | 34.1 (1.50) | 10.2 (0.19) |

spambase | 57 | 7.5 (0.19) | 7.5 (0.19) | 8.1 (0.24) | 20.2 (1.33) | 40.4 (0.60) | 10.9 (0.25) | 22.9 (1.18) | 31.8 (1.31) | 40.6 (1.17) | 5.9 (0.15) |

w8a | 300 | 6.0 (0.18) | 6.0 (0.18) | 17.2 (0.41) | 11.2 (0.21) | 18.1 (1.07) | 12.6 (0.67) | 11.7 (0.24) | 12.8 (0.48) | 38.8 (0.99) | 9.1 (0.30) |

waveform | 21 | 4.5 (0.13) | 4.6 (0.14) | 5.2 (0.22) | 48.5 (0.17) | 48.5 (0.20) | 47.6 (0.22) | 48.5 (0.17) | 39.3 (1.13) | 34.7 (1.54) | 4.4 (0.14) |

# Outperform | 7 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |

(a) $nSD=50,\pi +=0.7$ . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

. | . | Proposed . | Baselines . | ||||||||

Data Set . | # Dim. . | SDU . | SDDU . | SU . | KM . | CKM . | SSP . | ITML . | CRL . | OVPC . | MCL . |

adult | 123 | 26.4 (1.12) | 23.6 (0.83) | 35.2 (1.03) | 35.0 (0.81) | 33.4 (1.05) | 30.6 (0.29) | 38.2 (0.64) | 38.3 (0.99) | 41.4 (0.87) | 26.0 (1.10) |

banana | 2 | 33.9 (0.73) | 33.5 (0.68) | 35.7 (0.72) | 47.1 (0.36) | 47.3 (0.35) | 41.3 (0.73) | 47.1 (0.36) | 43.7 (0.63) | 38.0 (0.98) | 33.0 (0.77) |

codrna | 8 | 20.1 (1.14) | 18.9 (1.17) | 24.6 (1.01) | 37.4 (0.50) | 38.5 (0.40) | 45.4 (0.99) | 37.4 (0.50) | 41.0 (0.66) | 37.1 (0.99) | 32.1 (1.20) |

ijcnn1 | 22 | 33.0 (0.73) | 32.2 (0.61) | 36.5 (0.96) | 44.5 (0.63) | 45.3 (0.50) | 40.0 (0.79) | 44.9 (0.71) | 44.9 (0.60) | 42.1 (0.89) | 32.9 (1.13) |

magic | 10 | 34.5 (0.83) | 34.2 (0.87) | 40.0 (0.82) | 47.6 (0.21) | 48.2 (0.18) | 47.3 (0.27) | 47.6 (0.21) | 42.2 (0.52) | 43.2 (0.74) | 29.9 (1.20) |

phishing | 68 | 22.1 (1.21) | 21.4 (1.11) | 27.3 (1.30) | 37.4 (0.33) | 37.4 (0.31) | 31.9 (0.30) | 37.4 (0.34) | 33.8 (1.09) | 39.7 (0.81) | 17.8 (1.58) |

phoneme | 5 | 29.6 (0.78) | 29.2 (0.79) | 32.7 (0.98) | 32.2 (0.34) | 31.1 (0.45) | 33.5 (1.01) | 32.2 (0.34) | 35.5 (1.01) | 37.8 (0.92) | 32.1 (0.91) |

spambase | 57 | 21.0 (1.40) | 20.4 (1.27) | 31.7 (1.71) | 36.3 (1.08) | 35.8 (1.06) | 29.6 (0.31) | 39.1 (1.10) | 37.1 (1.14) | 39.8 (0.92) | 14.3 (0.83) |

w8a | 300 | 36.2 (1.26) | 33.0 (1.19) | 41.3 (0.80) | 30.7 (0.26) | 34.0 (0.67) | 36.0 (0.69) | 31.1 (0.29) | 31.7 (0.36) | 43.4 (0.75) | 39.3 (1.07) |

waveform | 21 | 17.3 (0.98) | 15.8 (0.88) | 26.7 (1.48) | 48.5 (0.17) | 48.4 (0.18) | 46.7 (0.35) | 48.5 (0.17) | 30.6 (1.66) | 36.4 (1.43) | 12.2 (0.31) |

# Outperform | 4 | 5 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 6 | |

(b) $nSD=50,\pi +=0.9([[$ | |||||||||||

Proposed | Baselines | ||||||||||

Data Set | # Dim. | SDU | SDDU | SU | KM | CKM | SSP | ITML | CRL | OVPC | MCL |

adult | 123 | 9.8 (0.43) | 9.7 (0.41) | 23.7 (0.66) | 22.0 (1.55) | 33.7 (1.42) | 11.5 (0.26) | 28.5 (1.46) | 41.7 (0.79) | 38.5 (0.94) | 14.6 (1.25) |

banana | 2 | 10.4 (0.23) | 10.4 (0.22) | 12.7 (0.63) | 45.5 (0.31) | 46.0 (0.30) | 31.7 (1.47) | 45.4 (0.31) | 39.3 (0.80) | 23.9 (1.97) | 10.7 (0.32) |

codrna | 8 | 6.9 (0.39) | 6.9 (0.45) | 15.8 (0.70) | 32.9 (1.12) | 36.6 (1.05) | 42.3 (1.93) | 32.8 (1.13) | 38.0 (1.54) | 33.1 (1.65) | 10.0 (0.22) |

ijcnn1 | 22 | 10.0 (0.31) | 9.9 (0.31) | 14.3 (0.62) | 40.0 (1.00) | 41.0 (0.81) | 29.4 (1.34) | 39.1 (1.12) | 41.1 (0.73) | 36.7 (1.55) | 14.0 (1.32) |

magic | 10 | 11.7 (0.33) | 11.6 (0.29) | 21.8 (0.85) | 36.4 (0.45) | 38.8 (0.38) | 45.6 (0.44) | 36.4 (0.45) | 29.5 (1.15) | 41.1 (0.99) | 14.5 (0.96) |

phishing | 68 | 8.7 (0.38) | 8.5 (0.32) | 18.0 (0.85) | 24.7 (0.36) | 25.7 (0.38) | 13.6 (0.41) | 24.8 (0.36) | 33.1 (1.35) | 40.5 (0.94) | 9.2 (0.94) |

phoneme | 5 | 11.1 (0.27) | 11.3 (0.27) | 15.8 (0.67) | 40.4 (0.51) | 40.8 (0.35) | 26.7 (1.76) | 40.1 (0.56) | 36.2 (1.39) | 31.3 (1.62) | 11.7 (0.85) |

spambase | 57 | 8.8 (0.23) | 8.5 (0.24) | 16.7 (0.53) | 18.6 (1.48) | 32.3 (1.12) | 11.6 (0.34) | 21.7 (1.46) | 33.2 (1.88) | 40.0 (0.98) | 8.1 (0.78) |

w8a | 300 | 8.3 (0.50) | 8.3 (0.50) | 26.0 (0.56) | 11.2 (0.19) | 11.7 (0.22) | 18.3 (1.00) | 11.7 (0.26) | 14.6 (0.80) | 35.8 (1.23) | 27.4 (1.52) |

waveform | 21 | 5.2 (0.24) | 5.1 (0.24) | 8.2 (0.73) | 48.6 (0.17) | 48.5 (0.18) | 47.6 (0.23) | 48.6 (0.17) | 44.1 (0.71) | 35.1 (1.52) | 5.4 (0.74) |

# Outperform | 10 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | |

adult | 123 | 18.0 (0.46) | 17.8 (0.31) | 21.3 (0.54) | 36.7 (0.76) | 28.1 (0.90) | 30.7 (0.31) | 39.1 (0.64) | 33.0 (0.95) | 42.7 (0.79) | 19.7 (0.39) |

banana | 2 | 30.5 (0.42) | 31.0 (0.42) | 32.8 (0.52) | 47.5 (0.24) | 47.5 (0.24) | 33.5 (1.29) | 47.5 (0.24) | 42.3 (0.50) | 35.2 (0.77) | 32.0 (0.67) |

codrna | 8 | 10.8 (0.69) | 9.4 (0.41) | 18.1 (0.88) | 37.2 (0.51) | 40.5 (0.50) | 46.8 (0.72) | 37.4 (0.54) | 36.6 (1.33) | 27.6 (1.28) | 10.5 (0.79) |

ijcnn1 | 22 | 23.3 (0.44) | 22.6 (0.40) | 28.4 (0.64) | 45.8 (0.29) | 46.9 (0.35) | 40.5 (0.78) | 46.4 (0.35) | 44.3 (0.66) | 43.1 (0.81) | 16.2 (0.64) |

magic | 10 | 25.9 (0.65) | 25.6 (0.55) | 30.2 (0.71) | 48.0 (0.18) | 48.3 (0.17) | 47.4 (0.29) | 48.0 (0.18) | 40.3 (0.59) | 43.4 (0.77) | 22.1 (0.68) |

phishing | 68 | 12.0 (0.42) | 12.0 (0.41) | 17.2 (0.87) | 37.4 (0.31) | 37.2 (0.30) | 31.6 (0.29) | 37.4 (0.31) | 22.8 (1.34) | 42.9 (0.67) | 7.0 (0.18) |

phoneme | 5 | 25.5 (0.49) | 25.5 (0.49) | 27.5 (0.67) | 32.2 (0.31) | 29.0 (0.60) | 28.0 (0.73) | 32.0 (0.37) | 32.9 (1.07) | 37.1 (0.90) | 25.5 (0.51) |

spambase | 57 | 12.8 (0.27) | 12.3 (0.23) | 16.2 (0.62) | 38.2 (1.20) | 29.6 (0.56) | 29.4 (0.29) | 40.4 (1.14) | 34.8 (1.34) | 39.6 (1.05) | 9.5 (0.20) |

w8a | 300 | 20.4 (0.80) | 18.8 (0.69) | 35.8 (0.71) | 30.7 (0.27) | 43.7 (0.61) | 32.6 (0.60) | 31.7 (0.33) | 33.2 (0.69) | 45.1 (0.57) | 27.6 (0.76) |

waveform | 21 | 12.8 (0.29) | 12.6 (0.29) | 15.6 (0.59) | 48.5 (0.16) | 48.5 (0.14) | 46.9 (0.31) | 48.4 (0.16) | 17.5 (1.40) | 35.3 (1.26) | 10.7 (0.19) |

# Outperform | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | |

(d) $nSD=200,\pi +=0.9([[[$ | |||||||||||

Proposed | Baselines | ||||||||||

Data Set | # Dim. | SDU | SDDU | SU | KM | CKM | SSP | ITML | CRL | OVPC | MCL |

adult | 123 | 8.4 (0.24) | 8.3 (0.24) | 11.2 (0.32) | 27.4 (1.41) | 43.6 (0.95) | 11.1 (0.27) | 27.8 (1.25) | 43.3 (0.76) | 39.3 (1.11) | 9.0 (0.21) |

banana | 2 | 10.2 (0.19) | 10.2 (0.19) | 10.5 (0.24) | 45.5 (0.30) | 46.6 (0.32) | 25.8 (1.77) | 45.5 (0.29) | 40.0 (0.81) | 24.1 (1.63) | 10.2 (0.19) |

codrna | 8 | 4.1 (0.18) | 4.0 (0.19) | 9.8 (0.39) | 32.2 (1.10) | 40.4 (0.79) | 40.7 (2.11) | 32.5 (1.17) | 38.1 (1.16) | 29.1 (1.53) | 7.6 (0.20) |

ijcnn1 | 22 | 8.4 (0.20) | 8.3 (0.20) | 9.4 (0.24) | 40.4 (0.82) | 43.1 (0.63) | 27.7 (1.38) | 41.1 (0.90) | 41.2 (1.00) | 38.9 (1.26) | 7.7 (0.16) |

magic | 10 | 10.3 (0.23) | 10.2 (0.22) | 16.8 (0.73) | 37.0 (0.33) | 41.4 (0.34) | 45.2 (0.43) | 37.0 (0.32) | 32.7 (1.50) | 38.6 (1.35) | 10.0 (0.19) |

phishing | 68 | 6.3 (0.21) | 6.3 (0.22) | 8.9 (0.38) | 24.4 (0.26) | 27.6 (0.33) | 13.7 (0.38) | 24.5 (0.28) | 38.4 (1.22) | 40.8 (0.81) | 3.7 (0.13) |

phoneme | 5 | 10.3 (0.21) | 10.3 (0.19) | 12.4 (0.41) | 40.2 (0.54) | 40.5 (0.39) | 24.9 (1.70) | 40.3 (0.54) | 33.2 (1.53) | 34.1 (1.50) | 10.2 (0.19) |

spambase | 57 | 7.5 (0.19) | 7.5 (0.19) | 8.1 (0.24) | 20.2 (1.33) | 40.4 (0.60) | 10.9 (0.25) | 22.9 (1.18) | 31.8 (1.31) | 40.6 (1.17) | 5.9 (0.15) |

w8a | 300 | 6.0 (0.18) | 6.0 (0.18) | 17.2 (0.41) | 11.2 (0.21) | 18.1 (1.07) | 12.6 (0.67) | 11.7 (0.24) | 12.8 (0.48) | 38.8 (0.99) | 9.1 (0.30) |

waveform | 21 | 4.5 (0.13) | 4.6 (0.14) | 5.2 (0.22) | 48.5 (0.17) | 48.5 (0.20) | 47.6 (0.22) | 48.5 (0.17) | 39.3 (1.13) | 34.7 (1.54) | 4.4 (0.14) |

# Outperform | 7 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |

Note: Bold numbers indicate outperforming methods, chosen by one-sided $t$-test with a significance level of 5% # Dim. $=$ number of dimensions.

#### 5.3.1 SU Classification (SU)

The first baseline method is the SU classification method (Bao et al., 2018), where the classification risk is estimated from similar pairs and unlabeled data in an unbiased manner. This method is a special case of SDU classification where the coefficient parameters are fixed as $(\gamma 1,\gamma 2,\gamma 3)=(1,0,0)$.

#### 5.3.2 KMeans Clustering (KM)

K-means clustering (MacQueen, 1967) is one of the most popular unsupervised methods. It is applied to training data by ignoring all pairwise information. We predicted labels of test data with learned clusters.

#### 5.3.3 Constrained KMeans Clustering (CKM)

Constrained K-means clustering (Wagstaff et al., 2001) is a semisupervised clustering method based on K-means clustering, where pairwise similarities and dissimilarities are treated as must-links or cannot-links.

#### 5.3.4 Semisupervised Spectral Clustering (SSP)

Semisupervised spectral clustering was proposed in Chen and Feng (2012), where similar and dissimilar labels are propagated through an affinity matrix. We set $k=5$ for constructing the affinity matrix with k-nearest-neighbors graph and $\sigma 2=1$ for a precision parameter used in similarity measurement.

#### 5.3.5 Information-Theoretical Metric Learning (ITML)

Information-theoretical metric learning (Davis et al., 2007) is an algorithm to learn a matrix that parameterizes the Mahalanobis distance on given data points. Similar and dissimilar pairs are used for regularizing the covariance matrix. For test samples prediction, k-means clustering was applied with the obtained metric. We used the identity matrix as prior information, and a slack parameter $\gamma $ was set to 1.

#### 5.3.6 Contrastive Learning (CRL)

Contrastive learning (Arora et al., 2019) is another framework for learning a useful representation by leveraging similarity information. We used a linear model $g(x)=Wx$, where $W\u2208Rd'\xd7d$ as an embedding function from input space to representation space. In this experiment, we fixed $d'$ to 10 for all data sets. Each triplet used for training was created by concatenating a similar pair and an example randomly picked from unlabeled data. With the learned representations, K-means clustering was applied in the same manner as the KM method.

#### 5.3.7 On the Value of Pairwise Constraints (OVPC)

A classification-based approach was proposed in Zhang and Yan (2007), where an auxiliary classifier is trained on the feature vectors obtained from pairwise examples. The trained classifier is converted into a function that can be applied to pointwise prediction. The weight of $L2$ regularization is chosen from ${10-1,10-4,10-7}$ by five-fold cross-validation.

#### 5.3.8 Meta-Classification Likelihood (MCL)

A meta-learning approach was recently proposed by Hsu et al. (2019), where the objective is maximum likelihood estimation over similar and dissimilar labels. The conditional class probability was modeled by $p(y=1\u2223x)={1+exp(w\u22a4x+b)}-1$. A stochastic gradient descent algorithm was applied for optimization.

#### 5.3.9 Setup for Clustering Algorithms

For clustering methods, the number of clusters was set to two. To evaluate the accuracy of k-means-based clustering methods (i.e., KM, CKM, and ITML), test samples were completely separated from training samples. The labels of test samples are predicted based on the clusters obtained from only training samples. For SSP, the clustering algorithm was applied to both train and test samples so that we could predict for test samples. Since there is no explicit positive or negative assignment in clustering methods, their performances are evaluated by $min(r,1-r)$, where $r$ is misclassification rate.

### 5.4 Discussion

In section 5.3, we stated that the SD and DU classification methods are likely to outperform the SU classification method, which comes from the comparison of their estimation error bounds. As shown in Figure 2, we confirmed that the misclassification rates of the SU, DU, and SD classification methods are consistent with this statement.

Figure 3 indicates that more unlabeled data lead to better classification performance for the SDU and SDDU classification methods. We also found that the SDDU classification method not only reduces the computation cost for tuning the coefficient parameters but also often outperforms the SDU classification method. It might indicate the difficulty in tuning the coefficient parameters $(\gamma 1,\gamma 2,\gamma 3)$ only with similarities and dissimilarities for cross validation.

Table 2 demonstrates that the SDU and SDDU classification methods perform better than or comparable to other baselines in many scenarios. Specifically, we observed that the superiority of the proposed methods becomes outstanding in the situations where the number of pairwise data is limited and the positive and negative class priors are fairly imbalanced (see Table 2b). The first property suggests that the advantage gained from unlabeled data becomes significant when the amount of pairwise supervision is relatively small. The second one is consistent with the theoretical analysis in section 4.3, which states that the estimation error of the proposed method can increase as two class priors are approaching each other. Furthermore, we confirmed that our methods always benefit from the increased number of pairwise data, while most other clustering-based methods do not.

## 6 Conclusion and Future Work

In this letter, we proposed a novel weakly supervised classification method, similar-dissimilar-unlabeled (SDU) classification, where the classification risk is computed from pairwise similarities and dissimilarities and unlabeled data. We derived the estimation error bound for the proposed method and confirmed convergence to the optimal solution. From the theoretical analysis, we developed a strategy to reduce the computation cost for tuning the hyperparameter. Through experiments on benchmark data sets, we demonstrated that our SDU classification method performs better than baseline methods.

We discuss three important directions for future work. First, further research in a multiclass classification scenario is required. Our formulation relies on the connection between classification of similarity and classification of binary class labels. Since both are classification with binary outcomes, the extension to the multiclass case is not straightforward unless additional information is available. Second, in the SDU classification method, the positive and negative class proportions must not be equal, that is, $\pi +\u2260\pi -$. Even if they are not exactly equal, the estimation error can increase when $\pi +\u219212$, as mentioned in section 4.3. Our recent study (Bao, Shimada, Xu, Sato, & Sugiyama, 2020) partially overcomes this problem by ignoring the sign identification of a classifier, that is, a classifier is trained to minimize or maximize the classification error, but we cannot know which optimization is achieved without auxiliary information. Finally, the use of different types of pairwise supervision should be explored. Although this letter focused on binary representations of similarity and dissimilarity information, it would be more appealing if we can extend our method to handle other types of pairwise supervision, for example, confidence score (Ishida, Niu, & Sugiyama, 2018) and triplet comparison (Schroff, Kalenichenko, & Philbin, 2015; Cui, Charoenphakdee, Sato, & Sugiyama, 2020).

## Appendix: Proofs of Theorems

### A.1 Proof of Theorem 1

### A.2 Proof of Theorem 2

### A.3 Proof of Theorem 3

We apply a similar technique with the SU classification method to the DU and SD classification methods. Using pointwise distributions defined in equations 3.12 and 3.13, we have the following lemma.

**Lemma 1.**

**Lemma 2.**

**Lemma 3.**

**Proof.**

### A.4 Proof of Theorem 5

## Acknowledgments

H.B. was supported by JST ACT-I grant JPMJPR18UI. I.S. was supported by JST CREST grant JPMJCR17A1, Japan. M.S. was supported by JST CREST grant JPMJCR1403.

## References

*Proceedings of the 36th International Conference on Machine Learning*

*Proceedings of the 35th International Conference on Machine Learning*

*Similarity-based classification: Connecting similarity learning to binary classification*

*Proceedings of 19th International Conference on Machine Learning*

*Constrained clustering: Advances in algorithms, theory, and applications*

*Proceedings of the 21st International Conference on Machine Learning*

*ACM Transactions on Intelligent Systems and Technology*

*Semi-supervised learning*

*Proceedings of the 36th International Conference on Machine Learning*

*Neurocomputing*

*Advances in neural information processing systems*

*28*(pp.

*Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition*

*Neural Computation*

*Proceedings of the 24th International Conference on Machine Learning*

*Advances in neural information processing systems*

*Proceedings of the 32nd International Conference on Machine Learning*

*UCI machine learning repository*

*Neurocomputing*

*Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the Eighth IEEE International Conference on Data Mining*

*Advances in neural information processing systems*

*Advances in neural information processing systems*

*Proceedings of the 19th International Conference on Machine Learning*

*Proceedings of the IEEE 12th International Conference on Computer Vision*

*Proceedings of the International Conference on Learning Representations*

*International Conference on Learning Representations*

*Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability*

*Advances in neural information processing systems*

*Foundations of machine learning*

*European Journal of Social Psychology*

*Proceedings of the 29th International Conference on Machine Learning*(pp.

*Annals of the Institute of Statistical Mathematics*

*Representation learning with contrastive predictive coding*

*Proceedings of the 33rd International Conference on Machine Learning*

*Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*

*Proceedings of the 34th International Conference on Machine Learning*(pp.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*

*Advances in neural information processing systems*

*29*(pp.

*Proceedings of the 18th International Conference on Machine Learning*

*Journal of the American Statistical Association*

*Journal of Machine Learning Research*

*Class2Simi: A new perspective on learning with label noise*

*Advances in neural information processing systems*

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*Proceedings of the 30th International Conference on Machine Learning*

*Proceedings of the 24th International Conference on Machine Learning*

## Author notes

T.S. is now with Preferred Networks, Japan.