## Abstract

Learning from triplet comparison data has been extensively studied in the context of metric learning, where we want to learn a distance metric between two instances, and ordinal embedding, where we want to learn an embedding in a Euclidean space of the given instances that preserve the comparison order as much as possible. Unlike fully labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Although learning from triplet comparison data has been considered in many applications, an important fundamental question of whether we can learn a classifier only from triplet comparison data without all the labels has remained unanswered. In this letter, we give a positive answer to this important question by proposing an unbiased estimator for the classification risk under the empirical risk minimization framework. Since the proposed method is based on the empirical risk minimization framework, it inherently has the advantage that any surrogate loss function and any model, including neural networks, can be easily applied. Furthermore, we theoretically establish an estimation error bound for the proposed empirical risk minimizer. Finally, we provide experimental results to show that our method empirically works well and outperforms various baseline methods.

## 1 Introduction

Recently, learning from comparison-feedback data has received increasing attention (Heim, 2016; Kleindessner, 2017). It is usually argued that humans perform better in the task of evaluating which instances are similar, rather than identifying each individual instance (Stewart, Brown, & Chater, 2005). It is also argued that humans can achieve much better and more reliable performance on assessing the similarity on a relative scale (“Instance A is more similar to instance B than to instance C”) rather than on an absolute scale (“The similarity score between A and B is 0.9, while the one between A and C is 0.4”) (Kleindessner, 2017). Collecting data in this manner has the advantage of avoiding the problem caused by individuals' different assessment scales. However, the collected absolute similarity scores may only provide information on a comparison level in some applications, such as sensor localization (Liu, Wu, & He, 2004). It was shown that keeping only the relative comparison information can help an algorithm be resilient against measurement errors and achieve high accuracy (Xiao, Li, & Luo, 2006).

In this letter, we focus on the problem of learning from triplet comparison data, a common form of comparison-feedback data. A triplet comparison $(xa,xb,xc)$ contains the information that instance $xa$ is more similar to $xb$ than to $xc$. As one example, search engine query logs can readily provide feedback in the form of triplet comparisons (Schultz & Joachims, 2004). Given a list of website links ${A,B,C}$ for a query, if links $A$ and $B$ are clicked and the link $C$ is not clicked, we can formulate a triplet comparison as $(A,B,C)$. We can also collect unlabeled data sets first and collect triplet comparison afterward, such as the instrument data set (Mojsilovic & Ukkonen, 2019) and the car data set (Kleindessner, 2017). In these cases, data are collected in a totally unlabeled way.

Learning from triplet comparison data was initially studied in the context of metric learning (Schultz & Joachims, 2004), in which a consistent distance metric between two instances is assumed to be learned from data. The well-known triplet loss for face recognition was proposed in this line of research (Schroff, Kalenichenko, & Philbin, 2015; Yu, Liu, Gong, Ding, & Tao, 2018). When this loss function is used, an inductive mapping function can be efficiently learned from triplet comparison image data. At the same time, the problem of ordinal embedding has also been extensively studied (Agarwal et al., 2007; Van Der Maaten & Weinberger, 2012). It aims to learn an embedding of the given instances to the Euclidean space that preserves the order given by the data. Algorithms for large-scale ordinal embedding have been developed (Anderton & Aslam, 2019). In addition, many other problem settings have been considered for the situation of using only triplet comparison data, such as nearest-neighbor search (Haghiri, Ghoshdastidar, & von Luxburg, 2017), kernel function construction (Kleindessner & von Luxburg, 2017a) and outlier identification (Kleindessner & Von Luxburg, 2017b).

However, learning a binary classifier from triplet comparison data remained untouched until recently. A random forest construction algorithm (Haghiri, Garreau, & Luxburg, 2018) was proposed for both classification and regression. However, it first requires a labeled data set and needs to actively access a triplet comparison oracle many times. For passively collected triplet comparison data, a boosting-based algorithm (Perrot & von Luxburg, 2018) was recently proposed without accessing a triplet comparison oracle. However, a set of labeled data is still indispensable to initiating the training process. To the best of our knowledge, this letter is the first to tackle the problem of learning a classifier only from passively obtained triplet comparison data without accessing either a labeled data set or an oracle.

We show that we can learn a binary classifier from only passively obtained triplet comparison data. We achieve this goal by developing a novel method for learning a binary classifier in this setting with theoretical justification. We use the direct risk minimization framework given for the classification problem. We then show that the classification risk can be empirically estimated in an unbiased way given only triplet comparison data. Theoretically, we establish an estimation error bound for the proposed empirical risk minimizer, showing that learning from triplet comparison data is consistent. Our method also returns an inductive model, which is different from clustering and ordinal embedding and can be applied to unseen test data points. The test data would consist of single instances instead of triplet comparisons since our primitive goal is to perform a binary classification task on unseen data points.

In summary, for the problem of classification using only triplet comparison data, our contributions in this letter are three-fold:

We propose an empirical risk minimization method for binary classification using only passively obtained triplet comparison data, which gives us an inductive classifier.

We theoretically establish an estimation error bound for our method, showing that the learning is consistent.

We experimentally demonstrate the practical usefulness of our method.

## 2 Related Work

Our problem setting of learning a binary classifier from passively obtained triplet comparison data can be considered a type of a weakly supervised classification problem, where we do not have access to ground-truth labels (Zhou, 2017).

An approach based on constructing an unbiased risk estimator of the true classification risk from weakly supervised data has been explored in many problem settings; for example, positive-unlabeled classification (du Plessis, Niu, & Sugiyama, 2014; Niu, du Plessis, Sakai, Ma, & Sugiyama, 2016) and similarity-unlabeled classification (Bao, Niu, & Sugiyama, 2018) can be handled by the framework of learning from two sets of unlabeled data (Lu, Niu, Menon, & Sugiyama, 2019). Nevertheless, our problem setting is not a special case addressed by Lu et al. (2019) since we have only one set of triplet comparison data. We later show that we can formulate three different distributions, which is significantly different from the framework that Lu et al. (2019) used and can be considered as a case of learning from three sets of unlabeled data.

Moreover, our problem setting is also different from similarity-dissimilarity-unlabeled classification (Shimada, Bao, Sato, & Sugiyama, 2019) in the sense that we have no access to unlabeled data and similarity and dissimilarity pairs, only triplet comparison information. Furthermore, it is important to note that our problem setting is also different from preference learning (Fürnkranz & Hüllermeier, 2010) since we do not want to learn a ranking function but construct a binary classifier. Although we can first learn a ranking function and then decide a proper threshold to construct a binary classifier (Narasimhan & Agarwal, 2013), it is not straightforward to choose a proper threshold. Therefore, instead of this two-stage method, we focus on a method that can directly learn a binary classifier from triplet comparison data.

## 3 Learning a Classifier from Triplet Comparison Data

In this section, we first review the fully supervised classification setting. Then we introduce the problem setting and assumption for the data generation process of triplet comparison data. Finally, we describe the proposed method for training a binary classifier from only passively obtained triplet comparison data.

### 3.1 Preliminary

In the fully supervised classification setting, we are given both positive and negative training data collectively drawn from the joint density $p(x,y)$. However, in our case, we still want to train a binary classifier that minimizes the classification risk, although we do not have fully labeled data.

### 3.2 Generation Process of Triplet Comparison Data

Note that the ratio of $n1\u225c|D1|$ to $n2\u225c|D2|$ is fixed because we assume the three samples in a triplet are generated independently from $p(x,y)$; thus, the ratio $n1n2$ is only dependent on the underlying class prior probabilities, which are fixed, unknown values.

The two data sets can be considered to be generated from two underlying distributions, as indicated by the following lemma.

Detailed derivation is given in appendix A.

We denote the pointwise data collected from $D1$ and $D2$ by ignoring the triplet comparison relation as $D1,a\u225c{x1,a}i=1n1$, $D1,b\u225c{x1,b}i=1n1$, $D1,c\u225c{x1,c}i=1n1$, $D2,a\u225c{x2,a}i=1n2$, $D2,b\u225c{x2,b}i=1n2$ and $D2,c\u225c{x2,c}i=1n2$, the marginal densities of which can be expressed by the following theorem.

The proof is given in appendix B.

^{2}indicates that from triplet comparison data, we can essentially obtain samples that can be drawn independently from three different distributions. We denote the three aggregated data sets as

### 3.3 Unbiased Risk Estimator for Triplet Comparison Data

^{2}. Letting

Our goal is to solve equation 3.8 so that we can express $p+(x)$ and $p-(x)$ in terms of the three densities from which we have independent and identically distributed (i.i.d.) data samples. To this end, we can rewrite the classification risk, which we want to minimize, in terms of $p\u02dc1(x)$, $p\u02dc2(x)$, and $p\u02dc3(x)$. An answer to equation 3.8 is given by the following lemma.

Detailed derivation is given in appendix C.

As a result of lemma ^{3}, we can express the classification risk using only triplet comparison data. Letting $\u2113+(x)\u225c\u2113(f(x),+1)$ and $\u2113-(x)\u225c\u2113(f(x),-1)$, we have the following theorem.

The proof is given in appendix D.

In this letter, we consider the common case in which $\pi test=\pi +$, which means the test data set shares the same class prior as the training data set. However, even when $\pi test\u2260\pi +$, which means the class prior shift (Sugiyama, 2012) occurs, our method can still be used when $\pi test$ is known.

The process of obtaining the empirical risk minimizer of equation 3.10, $f^=argminR(f)$, is similar to other ERM-based learning approaches. As long as the risk representation that we want to minimize is continuous and differentiable with respect to the model parameters, such as the linear-in-parameter model or neural networks, we can use powerful stochastic optimization algorithms (Kingma & Ba, 2014).

## 4 Estimation Error Bound

We assume for any probability density $\mu $, the specified model $F$ satisfies $R(F)\u2264CFn$ for some constant $CF>0$. Also, let $f*\u225cargminf\u2208FR(f)$ be the true risk minimizer and $f^\u225cargminf\u2208FR^T,\u2113(f)$ the empirical risk minimizer.

The proof is given in appendix E.

Since $n$ appears in the denominator, it is obvious that when the class prior is fixed, the bound will get tighter as the triplet comparison data increase. However, it is not clear how the bound will behave when we fix the amount of triplet comparison data and change the class prior. Thus in Figure 1, we show the behavior of the coefficient term $CR|ac-b2|$ with respect to the same class prior of both training and test data sets. From the illustration, we can capture the rough trend that the bound gets tighter when the class prior becomes further from 0.5. We will investigate this behavior in experiments.

## 5 On Class Prior

In the previous sections, the class prior $\pi +$ is assumed known. For this simple case, we can directly use the proposed algorithm to separate test data as well as identify correct classes. However, it may not be true for many real-world applications. Two situations can be considered. For the worst case, no information about the class prior is given. Although we can still estimate a result for the class prior from data and obtain a classifier that is able to separate data for different classes, we cannot identify the correct class without the information about which class has a higher class prior. A better situation is that we have the information about which class has a higher class prior. By setting this class as the positive one, we can successfully train a classifier to identify the correct class. Thus, we assume that the positive class has a higher class prior, which means $\pi +>12$.

### 5.1 Class Prior Estimation from Triplet Comparison Data

## 6 Experiments

In this section, we conducted experiments using real-world data sets to evaluate and investigate the performance of the proposed method for triplet classification.

### 6.1 Baseline Methods

#### 6.1.1 KMEANS

As a simple baseline, we used $k$-means clustering (Macqueen, 1967) with $k=2$ on all the data instances of triplets while ignoring all the relation information.

#### 6.1.2 ITML

Information-theoretic metric learning (Davis, Kulis, Jain, Sra, & Dhillon, 2007) is a metric learning method that requires pairwise the relationship between data instances. From a triplet $(xa,xb,xc)$, we constructed pairwise constraints as $(xa,xb)$ being similar and $(xa,xc)$ being dissimilar. Using the metric returned by the algorithm, we conducted $k$-means clustering on test data. We used the identity matrix for prior knowledge and fix the slack variable as $\gamma =1$.

#### 6.1.3 TL

Triplet loss (Schroff et al., 2015) is a loss function proposed in the context of deep metric learning, which can learn a metric directly from triplet comparison data. Using the metric returned by the algorithm, we conducted $k$-means clustering on test data.

#### 6.1.4 SERAPH

Semisupervised metric learning paradigm with hypersparsity (Niu, Dai, Yamada, & Sugiyama, 2014) is a metric learning method based on entropy regularization. We formulated a pairwise relationship in the same manner as with ITML. Using the metric returned by ITML, we conducted $k$-means clustering on test data.

#### 6.1.5 SU

SU learning (Bao et al., 2018) is a method for learning a binary classifier from similarity and unlabeled data. We used the same method for estimating the class prior and considered the less similar sample in a triplet as unlabeled data.

### 6.2 Data Sets

#### 6.2.1 UCI Data Sets

We used six data sets from the UCI Machine Learning Repository (Asuncion & Newman, 2007). They are binary classification data sets, and we use the given labels for further triplet comparison data generation.

#### 6.2.2 Image Data Sets

We used three image data sets.

The MNIST (LeCun, Bottou, Bengio, & Haffner, 1998) data set consists of 70,000 examples associated with a label from 10 digits. Each data instance is a $28\xd728$ gray-scale image; thus, the input dimension is 784. To form a binary classification problem, we treat even numbers as the positive class and odd numbers as the negative class. The data were standardized to have zero mean and unit variance.

The Fashion MNIST (Xiao, Rasul, & Vollgraf, 2017) data set consists of 70,000 examples associated with a label from 10 fashion item classes. Each data instance is a $28\xd728$ gray-scale image; thus, the input dimension is 784. To form a binary classification problem, we treat five classes—T-shirt/top, Pullover, Dress, Coat, and Shirt—as positive class since they all represent upper—body clothing. The data were standardized to have zero mean and unit variance.

The CIFAR-10 (Krizhevsky & Hinton, 2009) data set consists of 60,000 examples associated with a label from 10 classes. Each image is given in a $32\xd732\xd73$ format; thus, the input dimension is 3,072. To form a binary classification problem, we treated four classes—airplane, automobile, ship, and truck—as positive classes since they all represent artificial objects.

Although these data sets have labels, using the triplet comparison data composed of labeled data fulfills the purpose of experiments, which is to assess whether the proposed method can work properly. As mentioned in section 1, the proposed method can be applied to situations where we do not have access to the labels.

### 6.3 Proposed Method

For the proposed method, we used a fully connected neural network with only one hidden layer of width 100 and rectified linear units (ReLUs) (Nair & Hinton, 2010) for all the data sets except for CIFAR-10. The width of the hidden layer was set to be 100 throughout all experiments. Adam (Kingma & Ba, 2014) was used for optimization. The neural network architecture used for CIFAR-10 is specified in appendix F. Two surrogate losses were used as indicated in Tables 1, 2, and 3.

. | Proposed Methods . | Baselines . | |||||
---|---|---|---|---|---|---|---|

Data Set . | Squared . | Double Hinge . | KMEANS . | ITML . | TL . | SERAPH . | SU . |

Adult | 65.54 (0.41) | 64.19 (0.61) | 71.94 (0.10) | 71.04 (1.00) | 61.48 (1.36) | 71.04 (1.00) | 75.88 (0.50) |

Breast | 97.41 (0.28) | 96.90 (0.31) | 96.20 (0.34) | 95.84 (0.29) | 93.87 (0.78) | 96.72 (0.23) | 65.26 (0.76) |

Diabetes | 70.71 (0.84) | 64.87 (0.74) | 66.69 (0.70) | 65.91 (0.69) | 64.38 (1.60) | 67.44 (0.78) | 34.42 (0.73) |

Magic | 61.75 (1.00) | 71.91 (0.39) | 65.08 (0.17) | 64.79 (0.17) | 65.42 (0.22) | 64.96 (0.19) | 34.77 (0.19) |

Phishing | 76.58 (0.30) | 74.95 (0.27) | 63.43 (0.50) | 63.75 (0.23) | 57.85 (0.92) | 63.42 (0.53) | 34.17 (0.22) |

Spambase | 62.08 (1.87) | 64.66 (1.04) | 63.59 (0.24) | 63.24 (0.31) | 59.59 (1.57) | 63.28 (0.34) | 60.27 (0.30) |

MNIST | 79.86 (0.35) | 80.78 (0.34) | 65.24 (0.25) | 0.00 (0.00) | 58.26 (1.24) | 0.00 (0.00) | 50.80 (0.03) |

Fashion | 89.73 (0.33) | 91.62 (0.33) | 74.90 (1.00) | 0.00 (0.00) | 76.83 (1.31) | 0.00 (0.00) | 49.85 (0.08) |

CIFAR10 | 76.39 (1.57) | 66.28 (2.51) | 64.17 (0.01) | 0.00 (0.00) | 60.17 (1.26) | 0.00 (0.00) | 59.50 (0.50) |

. | Proposed Methods . | Baselines . | |||||
---|---|---|---|---|---|---|---|

Data Set . | Squared . | Double Hinge . | KMEANS . | ITML . | TL . | SERAPH . | SU . |

Adult | 65.54 (0.41) | 64.19 (0.61) | 71.94 (0.10) | 71.04 (1.00) | 61.48 (1.36) | 71.04 (1.00) | 75.88 (0.50) |

Breast | 97.41 (0.28) | 96.90 (0.31) | 96.20 (0.34) | 95.84 (0.29) | 93.87 (0.78) | 96.72 (0.23) | 65.26 (0.76) |

Diabetes | 70.71 (0.84) | 64.87 (0.74) | 66.69 (0.70) | 65.91 (0.69) | 64.38 (1.60) | 67.44 (0.78) | 34.42 (0.73) |

Magic | 61.75 (1.00) | 71.91 (0.39) | 65.08 (0.17) | 64.79 (0.17) | 65.42 (0.22) | 64.96 (0.19) | 34.77 (0.19) |

Phishing | 76.58 (0.30) | 74.95 (0.27) | 63.43 (0.50) | 63.75 (0.23) | 57.85 (0.92) | 63.42 (0.53) | 34.17 (0.22) |

Spambase | 62.08 (1.87) | 64.66 (1.04) | 63.59 (0.24) | 63.24 (0.31) | 59.59 (1.57) | 63.28 (0.34) | 60.27 (0.30) |

MNIST | 79.86 (0.35) | 80.78 (0.34) | 65.24 (0.25) | 0.00 (0.00) | 58.26 (1.24) | 0.00 (0.00) | 50.80 (0.03) |

Fashion | 89.73 (0.33) | 91.62 (0.33) | 74.90 (1.00) | 0.00 (0.00) | 76.83 (1.31) | 0.00 (0.00) | 49.85 (0.08) |

CIFAR10 | 76.39 (1.57) | 66.28 (2.51) | 64.17 (0.01) | 0.00 (0.00) | 60.17 (1.26) | 0.00 (0.00) | 59.50 (0.50) |

### 6.4 Results

The proposed method estimates the unknown class prior first. For baseline methods, performances are measured by the clustering accuracy $1-min(r,1-r)$ where $r$ is the error rate. The results of different triplet numbers are listed in Tables 1, 2, and 3. The best and equivalent methods are shown in bold on the one-sided $t$-test with a significance level of $5%$. Also, as shown in Figure 2, the performance of the proposed method with respect to the class prior and the size of training data set followed the prediction by the theory in most of the cases.^{2}

. | Proposed Methods . | Baselines . | |||||
---|---|---|---|---|---|---|---|

Data Set . | Squared . | Double Hinge . | KMEANS . | ITML . | TL . | SERAPH . | SU . |

Adult | 62.72 (0.57) | 59.74 (1.44) | 71.44 (0.60) | 71.79 (0.20) | 58.53 (1.17) | 70.54 (1.09) | 76.30 (0.04) |

Breast | 96.90 (0.44) | 96.53 (0.35) | 96.28 (0.29) | 96.79 (0.24) | 89.67 (1.97) | 96.68 (0.27) | 64.12 (0.91) |

Diabetes | 69.64 (0.68) | 67.08 (0.91) | 66.27 (0.65) | 64.87 (0.66) | 63.15 (1.56) | 67.44 (0.68) | 33.90 (0.67) |

Magic | 63.86 (1.44) | 70.37 (0.36) | 64.86 (0.15) | 65.03 (0.13) | 66.36 (0.30) | 64.94 (0.14) | 34.83 (0.15) |

Phishing | 75.52 (0.31) | 74.57 (0.37) | 63.08 (0.47) | 63.31 (0.41) | 56.37 (1.18) | 62.73 (0.76) | 33.89 (0.20) |

Spambase | 61.18 (1.11) | 59.95 (1.38) | 63.55 (0.32) | 64.17 (0.31) | 59.35 (1.48) | 63.53 (0.35) | 58.96 (0.44) |

MNIST | 74.23 (0.32) | 75.19 (0.50) | 64.74 (0.55) | 0.00 (0.00) | 56.07 (0.87) | 0.00 (0.00) | 50.87 (0.26) |

Fashion | 83.83 (0.55) | 87.86 (0.66) | 75.40 (0.34) | 0.00 (0.00) | 76.66 (1.39) | 0.00 (0.00) | 49.88 (0.08) |

CIFAR10 | 66.28 (1.77) | 62.63 (2.53) | 64.16 (0.01) | 0.00 (0.00) | 61.26 (1.13) | 0.00 (0.00) | 59.05 (0.65) |

. | Proposed Methods . | Baselines . | |||||
---|---|---|---|---|---|---|---|

Data Set . | Squared . | Double Hinge . | KMEANS . | ITML . | TL . | SERAPH . | SU . |

Adult | 62.72 (0.57) | 59.74 (1.44) | 71.44 (0.60) | 71.79 (0.20) | 58.53 (1.17) | 70.54 (1.09) | 76.30 (0.04) |

Breast | 96.90 (0.44) | 96.53 (0.35) | 96.28 (0.29) | 96.79 (0.24) | 89.67 (1.97) | 96.68 (0.27) | 64.12 (0.91) |

Diabetes | 69.64 (0.68) | 67.08 (0.91) | 66.27 (0.65) | 64.87 (0.66) | 63.15 (1.56) | 67.44 (0.68) | 33.90 (0.67) |

Magic | 63.86 (1.44) | 70.37 (0.36) | 64.86 (0.15) | 65.03 (0.13) | 66.36 (0.30) | 64.94 (0.14) | 34.83 (0.15) |

Phishing | 75.52 (0.31) | 74.57 (0.37) | 63.08 (0.47) | 63.31 (0.41) | 56.37 (1.18) | 62.73 (0.76) | 33.89 (0.20) |

Spambase | 61.18 (1.11) | 59.95 (1.38) | 63.55 (0.32) | 64.17 (0.31) | 59.35 (1.48) | 63.53 (0.35) | 58.96 (0.44) |

MNIST | 74.23 (0.32) | 75.19 (0.50) | 64.74 (0.55) | 0.00 (0.00) | 56.07 (0.87) | 0.00 (0.00) | 50.87 (0.26) |

Fashion | 83.83 (0.55) | 87.86 (0.66) | 75.40 (0.34) | 0.00 (0.00) | 76.66 (1.39) | 0.00 (0.00) | 49.88 (0.08) |

CIFAR10 | 66.28 (1.77) | 62.63 (2.53) | 64.16 (0.01) | 0.00 (0.00) | 61.26 (1.13) | 0.00 (0.00) | 59.05 (0.65) |

. | Proposed Methods . | Baselines . | |||||
---|---|---|---|---|---|---|---|

Data Set . | Squared . | Double Hinge . | KMEANS . | ITML . | TL . | SERAPH . | SU . |

Adult | 58.12 (0.90) | 55.10 (1.00) | 70.54 (1.50) | 70.04 (1.17) | 58.28 (0.94) | 68.54 (1.67) | 75.27 (0.51) |

Breast | 96.68 (0.32) | 96.50 (0.35) | 95.91 (0.34) | 96.24 (0.24) | 94.27 (0.68) | 96.64 (0.28) | 66.20 (0.80) |

Diabetes | 69.25 (0.98) | 65.36 (0.89) | 64.97 (0.87) | 67.27 (0.72) | 63.47 (1.22) | 67.11 (0.82) | 35.23 (0.94) |

Magic | 60.54 (1.88) | 68.56 (0.53) | 64.88 (0.13) | 65.15 (0.14) | 66.31 (0.42) | 64.97 (0.15) | 34.60 (0.34) |

Phishing | 72.22 (0.62) | 72.11 (0.65) | 63.70 (0.26) | 63.71 (0.21) | 57.02 (1.41) | 63.17 (0.77) | 34.03 (0.32) |

Spambase | 57.69 (1.68) | 55.74 (1.19) | 63.78 (0.34) | 63.04 (0.35) | 60.78 (1.63) | 63.74 (0.25) | 58.92 (0.43) |

MNIST | 67.14 (0.67) | 70.96 (0.53) | 64.49 (1.00) | 0.00 (0.00) | 57.88 (1.43) | 0.00 (0.00) | 50.10 (0.62) |

Fashion | 76.67 (0.40) | 83.74 (0.55) | 74.90 (1.00) | 0.00 (0.00) | 73.24 (1.80) | 0.00 (0.00) | 47.97 (0.76) |

CIFAR10 | 63.14 (1.68) | 58.83 (2.16) | 64.16 (0.01) | 0.00 (0.00) | 61.23 (1.18) | 0.00 (0.00) | 58.65 (0.66) |

. | Proposed Methods . | Baselines . | |||||
---|---|---|---|---|---|---|---|

Data Set . | Squared . | Double Hinge . | KMEANS . | ITML . | TL . | SERAPH . | SU . |

Adult | 58.12 (0.90) | 55.10 (1.00) | 70.54 (1.50) | 70.04 (1.17) | 58.28 (0.94) | 68.54 (1.67) | 75.27 (0.51) |

Breast | 96.68 (0.32) | 96.50 (0.35) | 95.91 (0.34) | 96.24 (0.24) | 94.27 (0.68) | 96.64 (0.28) | 66.20 (0.80) |

Diabetes | 69.25 (0.98) | 65.36 (0.89) | 64.97 (0.87) | 67.27 (0.72) | 63.47 (1.22) | 67.11 (0.82) | 35.23 (0.94) |

Magic | 60.54 (1.88) | 68.56 (0.53) | 64.88 (0.13) | 65.15 (0.14) | 66.31 (0.42) | 64.97 (0.15) | 34.60 (0.34) |

Phishing | 72.22 (0.62) | 72.11 (0.65) | 63.70 (0.26) | 63.71 (0.21) | 57.02 (1.41) | 63.17 (0.77) | 34.03 (0.32) |

Spambase | 57.69 (1.68) | 55.74 (1.19) | 63.78 (0.34) | 63.04 (0.35) | 60.78 (1.63) | 63.74 (0.25) | 58.92 (0.43) |

MNIST | 67.14 (0.67) | 70.96 (0.53) | 64.49 (1.00) | 0.00 (0.00) | 57.88 (1.43) | 0.00 (0.00) | 50.10 (0.62) |

Fashion | 76.67 (0.40) | 83.74 (0.55) | 74.90 (1.00) | 0.00 (0.00) | 73.24 (1.80) | 0.00 (0.00) | 47.97 (0.76) |

CIFAR10 | 63.14 (1.68) | 58.83 (2.16) | 64.16 (0.01) | 0.00 (0.00) | 61.23 (1.18) | 0.00 (0.00) | 58.65 (0.66) |

## 7 Conclusion

In this letter, we proposed a novel method for learning a classifier from only passively obtained triplet comparison data. We established an estimation error bound for the proposed method and confirmed that the estimation error decreases as the amount of triplet comparison data increases. We also empirically confirmed that the performance of the proposed method surpassed multiple baseline methods on various data sets. For future work, it would be interesting to investigate alternative methods that can handle a multiclass case.

## Appendix A: Proof of Lemma ^{1}

## Appendix B: Proof of Theorem ^{2}

## Appendix C: Proof of Lemma ^{3}

## Appendix D: Proof of Theorem ^{4}

## Appendix E: Proof of Theorem ^{5}

^{5}is proved.$\u25a1$

## Appendix E: CNN Structure for CIFAR10

The following structure is used:

Convolution (3 in/32 out-channels, kernel size 3) with ReLU

Convolution (32 in/32 out-channels, kernel size 3) with ReLU

Max-pooling (kernel size 2, stride 2)

Repeat twice:

Convolution (32 in/32 out-channels, kernel size 3) with ReLU

Convolution (32 in/32 out-channels, kernel size 3) with ReLU

Max-pooling (kernel size 2, stride 2)

Fully connected (512 units) with ReLU

Fully connected (1 unit)$\u25a1$

## Acknowledgments

Z.C. was supported by the IST-RA program, the University of Tokyo. N.C. was supported by a MEXT scholarship. I.S. was supported by JST CREST, grant JPMJCR17A1, Japan. M.S. was supported by the International Research Center for Neurointelligence (WPI-IRCN) at the University of Tokyo Institutes for Advanced Study. We thank Ikko Yamane and Han Bao for fruitful discussions on this work.