## Abstract

We study the problem of classification when only a dissimilarity function between objects is accessible. That is, data samples are represented not by feature vectors but in terms of their pairwise dissimilarities. We establish sufficient conditions for dissimilarity functions to allow building accurate classifiers. The theory immediately suggests a learning paradigm: construct an ensemble of simple classifiers, each depending on a pair of examples; then find a convex combination of them to achieve a large margin. We next develop a practical algorithm referred to as dissimilarity-based boosting (DBoost) for learning with dissimilarity functions under theoretical guidance. Experiments on a variety of databases demonstrate that the DBoost algorithm is promising for several dissimilarity measures widely used in practice.

## 1. Introduction

In classification problems, objects are often represented by feature vectors in a Euclidean space. The Euclidean feature space provides many more analytical tools for classification than other representations. However, such a representation requires the selection of features, which is usually difficult and domain dependent. For example, in the area of fingerprint analysis, it took scientists more than 100 years to discover useful features for fingerprint recognition (Maltoni, Maio, Jain, & Prabhakar, 2003). It is not clear even today what kinds of features have good discrimination ability for human face recognition, and existing feature extraction algorithms are not reliable or accurate (Zhao, Chellappa, Phillips, & Rosenfeld, 2003).

An alternative way is to describe the patterns using dissimilarity functions. Dissimilarity is a function that reflects the “distance” between two objects but with no restrictions on its mathematical properties. For some applications, such as image retrieval, dissimilarity representation has the advantage that it is more convenient to define such a measure than a set of meaningful features (Jacobs, Weinshall, & Gdalyahu, 2000; Jain & Zongker, 1997). A number of image dissimilarities have been proposed (Simard, LeCun, & Denker, 1993; Huttenlocher, Klanderman, & Rucklidge, 1993; Li, Chen, & Chi, 2002; Wang, Zhang, & Feng, 2005) and successfully used in real-world applications. Dissimilarity functions can also be defined on strings, graphs, and other structured objects (Gärtner, 2003; Saigo, Vert, Ueda, & Akutsu, 2004). This procedure thus provides a bridge between classical and structural approaches to pattern classification (Graepel, Herbrich, Bollmann-Sdorra, & Obermayer, 1999; Goldfarb, 1985).

The simplest method to classify objects in dissimilarity representations is the nearest neighbor (NN) rule. NN has an appealing asymptotic property that its error rate converges to the Bayes optimal risk (Hart & Cover, 1967). However, the rate of convergence of NN could be very slow (Fukunaga, 1990), and it is observed in practice that NN is sensitive to the choice of dissimilarity measures and noise (Breiman, Friedman, Olshen, & Stone, 1984).

In contrast to NN, several algorithms take into account global information as well. One type of method first embeds the data into a (possibly pseudo) Euclidean space, and then applies traditional Euclidean classification algorithms, with modifications adapted to the pseudo-Euclidean if necessary (Graepel et al., 1999). Another type of method explicitly constructs feature representations of the objects via their (dis)similarities to a set of prototypes and then runs standard linear separator algorithms like support vector machines (Vapnik, 1998) in the new space (Balcan, Blum, & Vempala, 2004, 2006; Pekalska & Duin, 2002). All of these algorithms demonstrate superior performance to NN on a number of data sets.

In accordance with the progress made in algorithm development, a theoretical foundation of learning with dissimilarities is needed. In some cases, a theory of learning can be given in the form of sufficient conditions for efficient learning. The kernel theory is such an example. The theory states that a large margin is a sufficient condition for good kernels. If the data are well separated with a large margin in an implicit high-dimensional space induced by the kernel, then the kernel is good. That is, there exists a learning algorithm that can generate a classifier having a low generalization error with a small number of training examples. Also important in practice is that the large margin condition implies learning algorithms that are computationally efficient. The algorithms that involve only the inner product in the original input space can be kernelized by replacing the inner product by a positive semidefinite kernel.

Recently, Balcan and Blum (2006) developed a theory of learning with similarities in this way. They defined a notion of what it means for a pairwise function to be a good similarity function for a learning problem. They showed that their definition is sufficient to allow one to learn well and captures the standard notion of a good kernel with some degradation on learning parameters (see also Srebro, 2007; Balcan, Blum, & Srebro, 2008b, for details). This theory immediately suggests algorithms that use feature representation based on prototypes as described earlier and therefore provides a theoretical explanation of their good empirical performances.

In this letter, we develop a theory for learning with dissimilarity functions, in parallel to Balcan and Blum's results. We propose new sufficient conditions for a dissimilarity function that yields good learning guarantees. These sufficient conditions also suggest a computationally efficient algorithm, which is a boosting-type algorithm that combines an ensemble of simple classifiers of special forms. We then make the algorithm more suitable for practical use. An advantage of our theory and algorithm is that they are applicable to unbounded dissimilarity functions, while previous results deal with normalized similarity measures.

The letter is organized as follows. We describe our theory in section 2. In section 3, a practical algorithm, DBoost, is proposed for learning with dissimilarity functions as a consequence of the theory. We provide experimental evidence of the benefits of our algorithm in sections 4 and 5 and conclude in section 6.

## 2. Theory

In this section we describe our theory of learning with dissimilarity functions. We propose sufficient conditions that have good learning guarantees and imply computationally efficient algorithms. We begin with a simple yet intuitively reasonable sufficient condition and then generalize it to incorporate more sophisticated cases.

### 2.1. Notations.

By *dissimilarity*, we mean any nonnegative bivariate function *d*(*x, x*′), where *x, x*′ *∈ X*, and *X* is an instance space. The axioms of a metric—reflectivity, symmetry, and triangle inequality—are not necessary for a dissimilarity function.

Labeled examples are represented by *z, z*′, *z*″,…, where *z* = (*x*, *y*), *x ∈ X*, and . The examples are drawn randomly and either independently or conditionally independently from the underlying distribution *P* of the problem over . These are always clear from the context. *I* denotes the indicator function, and if *x*>0 and −1 otherwise.

### 2.2. Sufficient Conditions for Learning with Dissimilarity Functions.

We propose in this section sufficient conditions for a dissimilarity function that are useful for learning.

#### 2.2.1. Strong (ϵ, γ)-goodness.

We first give a notion of good dissimilarity functions, which is quite intuitive. This definition expresses that if most examples are more likely to be close to random examples *z*′ of the same class than to *z*″ of the opposite class, the dissimilarity function is good. More precisely, we use an accuracy parameter ϵ and a margin parameter γ to characterize the goodness of a dissimilarity function.

The notion of strongly (ϵ, γ)-good dissimilarity functions suggests a simple learning algorithm: draw pairs of examples of different labels, and vote according to which class the test example is more likely to be close to. This is summarized in the following theorem:

*M*be the set of examples satisfying equation 2.1. For any fixed

*z*= (

*x, y*)

*∈ M*, where the probability is over random examples

*z*′ = (

*x*′,

*y*′ and

*z*″ = (

*x*″,

*y*″). Thus, inequality 2.1 is equivalent to The Chernoff bound then implies that where

*P*

_{Sn}denotes the probability over the choice of

*n*pairs of training examples. Since the above inequality holds for every

*z ∈ M*, we can take the expectation over all

*z ∈ M*, which results in that the expected error is at most

*e*

^{− n γ2/2}, that is, Note that Thus, we have Next, using the Markov inequality, we obtain that the probability that the error rate over the set

*M*is larger than θ is at most

*e*

^{− nγ2/2}/θ for arbitrary θ > 0: Finally, setting δ =

*e*

^{− nγ2/2}/θ and adding the ϵ probability of examples

*z*not in

*M*completes the proof.

#### 2.2.2. (ϵ, γ, *B*)-goodness.

*x*and

*x*′ used here is

*d*(

*x, x*′) = ∣

*x*−

*x*′∣. This problem should be perfectly learned with this dissimilarity function. However, for positive example

*x*≈ 1/8 or negative example

*x*≈ −1/8, the probability is not one. In fact, it is not difficult to show that for any positive example

*x ∈*[1/8, 1/4] or negative example

*x ∈*[−1/4, −1/8], we have That is, 1/8 probability mass of examples does not have a margin larger than 3/4. Thus, the dissimilarity function is not strongly (1/8, γ)-good for any γ > 3/4.

Notice however, for any example (*x*, *y*) in the problem, when randomly choosing *x*′ (or *x*″) of the same (or opposite) class of *x*, if we use only the examples near the boundary, then we would have that *d*(*x, x*′) < *d*(*x, x*″) (*y*′ = *y*, *y*″ = −*y*) holds for all examples (*x*, *y*). To be concrete, if we draw *x*′ according to a new distribution , which is the uniform distribution on [1/8, 3/8], and draw *x*″ according to , which is the uniform distribution on [−3/8, −1/8], we can learn a zero-error classifier. Therefore, with respect to the new distributions , the dissimilarity function is perfect: it is strongly (0, 1)-good for the problem.

Generally if we know that the dissimilarity function is strongly good with respect to distributions and , we could reweight the data as if they were generated from and learn the classifier in the same way as described in theorem 1.

In the following definition, a further step is made. We assume the existence of the new distributions, which are not necessarily known a priori. This definition therefore captures a broad class of dissimilarity functions. Later it will become clear that this assumption alone is sufficient to learn an accurate classifier.

*Denote by p(x ∣ y = 1) and p(x ∣ y = −1) the conditional pdfs of the learning problem. A dissimilarity function d is said to be (ϵ, γ, B)-good for the learning problem if:*

The next theorem says that (ϵ, γ, *B*)-goodness guarantees the existence of a low-error, large-margin classifier, which is a convex combination of the base classifiers:

*If d is a*(ϵ, γ,

*B*)-

*good dissimilarity function, then with probability at least*1−

*δ over the choice of n*= 16

*B*

^{2}/γ

^{2}

*ln*(1/δ)

*pairs of examples*((

*x*′

_{i}, 1), (

*x*″

_{i}, −1)),

*i*= 1, 2,…,

*n, there exists a convex combination classifier f*(

*x*)

*of n base classifiers h*(

_{i}*x*):

*where*

*such that the error rate of the combined classifier at margin {γ/2B} is at most ϵ + δ, that is*,

^{2}Let

*M*be the set of examples satisfying equation 2.2. For a fixed

*z*= (

*x, y*)

*∈ M*, Hence, is equivalent to Note that , the above inequality, together with Hoeffding's inequality, implies that Let . We have: Since , we obtain According to the previous inequality, taking expectation over all

*z ∈ M*and then using the Markov inequality as in the proof of theorem 1, we complete the proof.

*B*)-good dissimilarity function

*d*. First, draw a set

*S*

_{1}that contains pairs of examples ((

*x*′

_{i}, 1), (

*x*″

_{i}, −1)),

*i*= 1, 2,…,

*n*, and then construct

*n base*classifiers It is guaranteed that with probability 1 − δ, there exists a low-error and large-margin classifier, which is a convex combination of these

*h*(

_{i}*x*). Boosting would be natural for learning this large-margin voting classifier. Thus, one draws an additional set of examples

*S*

_{2}, uses boosting to learn the combination coefficients α

_{i}, and obtains the final classifier, In order that the final classifier

*H*(

*x*) has an error rate at most ϵ + ϵ

_{1}with probability at least 1 − 2δ, the size of the second training set

*S*

_{2}can be set as according to the margin bound for convex combination classifiers (Schapire, Freund, Bartlett, & Lee, 1998; Wang, Sugiyama, Yang, Zhou, & Feng, 2008).

^{3}The total number of examples needed to achieve such a learning guarantee is if we set δ = ϵ

_{1}.

#### 2.2.3. Generalized (ϵ, γ, *B*)-Goodness.

*B*)-goodness. Recall in definition 2 that for a (ϵ, γ,

*B*)-good dissimilarity function

*d*, most examples

*z*= (

*x*,

*y*) satisfy where is the probability with respect to two (unknown) pdfs and . A broader class of dissimilarity functions would be that most examples

*z*= (

*x*,

*y*) satisfy for some threshold

*v*, which may depend on the example pair (

*x*′,

*x*″).

*Denote by p(x ∣ y = 1) and p(x ∣ y = −1) the conditional pdfs of the learning problem. A dissimilarity function d is said to be generalized (ϵ, γ, B)-good for the learning problem if:*

*where is the probability with respect to and*.

The learning guarantee of the generalized (ϵ, γ, *B*)-good dissimilarity functions is the same as that of the (ϵ, γ, *B*)-good dissimilarities if the threshold is known.

*Let*:

*d*be a generalized (ϵ, γ,*B*)-good dissimilarity function. Assume that the threshold*v*(*x*′,*x*″) is known. Then with probability at least 1−δ over the choice of*n*= 16*B*^{2}/γ^{2}*ln*(1/δ) pairs of examples (*z*′_{i},*z*″_{i}) with labels there exists a convex combination classifier*f*(*x*) of*n*base classifiers*h*(_{i}*x*)*where*

*such that the error rate of the combined classifier at margin γ/2*

*B*is at most ϵ + δ.The proof follows the proof of theorem 2 by replacing with .

*B*)-good dissimilarity function guarantees efficient learning if the threshold is given. In practice, it is difficult to know the threshold a priori. However, one can learn the thresholds and the linear coefficients simultaneously in the boosting framework. We draw a set

*S*

_{1}of

*n*pairs of examples

*i*= 1, 2,…,

*n*, and a training set

*S*

_{2}. Then we use boosting to learn the thresholds

*v*(

*x*′,

*x*″) and the coefficients α

_{i}so that the combined classifier, has low error and large margin on the training set

*S*

_{2}. We describe this algorithm in detail and make it more practical in section 3.

#### 2.2.4. (ϵ, γ, *B*, η)-Goodness.

We propose a further generalization of the previous goodness definitions by weakening the first condition in definition 3. Here we do not require uniform upper bounds of . We only need that for most examples. This new definition of goodness can be applied to a much broader class of dissimilarity functions and contains all previous goodness notions as special cases.

*Denote by p(x ∣ y = 1) and p(x ∣ y = −1) the conditional pdfs of the learning problem. A dissimilarity function d is said to be (ϵ, γ, B, η)-good for the learning problem if:*

The learning guarantee of a (ϵ, γ, *B*, η)-good dissimilarity function is the same as the (ϵ, γ, *B*)-good dissimilarity only up to a constant factor.

*If*

*d*is a (ϵ, γ,*B*, η)-good dissimilarity function, then with probability at least 1−δ over the choice of*n*= 16*B*^{2}/γ^{2}*ln*(2/δ) pairs of examples (*z*′_{i},*z*″_{i}) with labels*y*′_{i}= 1,*y*″_{i}= −1,*i*= 1, 2,…,*n*, there exists a convex combination classifier*f*(*x*) of*n*base classifiers*h*(_{i}*x*):*where*

*such that the error rate of the combined classifier at margin {γ/2*.

*B*} is at most ϵ + δ, provided and the threshold is knownThe proof is almost the same as theorem 3. The only difference is that by our assumption, when choosing pairs of examples, the probability that there contains a “bad” example is at most δ/2, where a bad example or means that or , respectively.

### 2.3. Discussions of the Sufficient Conditions.

In this section we compare our results with existing theory on learning with (dis)similarity functions and study possible extension of the proposed sufficient conditions.

#### 2.3.1. Comparison to Previous Theory on Learning with Similarity Functions.

We first point out that our results of dissimilarity functions can be easily extended to similarity functions. Let *s*(*x, x*′) denote a similarity function. Replacing *d*(*x, x*′) < *d*(*x, x*″) by *s*(*x, x*′) > *s*(*x, x*″) in all definitions and theorems gives the theory for similarity functions. Therefore, the theory is a unified framework for learning with similarity and dissimilarity functions.

*s*is said to be (ϵ, γ)-good for a (deterministic label) learning problem if there exists a weighting function

*w*(

*x*) ∈ [0, 1], such that at least 1−ϵ probability mass of examples

*x*satisfies For a (ϵ, γ)-good similarity function, there is a simple learning approach. First, draw a set of examples {

*x*

_{1},

*x*

_{2},…,

*x*

_{n}}. Then construct an

*n*-dimensional feature vector of each object

*x*as It can be shown with high probability that there is a large-margin, low-error linear separator in the

*n*-dimensional feature space, so linear SVM can be used to learn the classifier in the feature space with a new set of examples.

On the practical side, Balcan and Blum's theory implies an SVM-type algorithm, while ours suggests a boosting-type algorithm. On the theoretical side, their notion of good similarity functions and ours are different sufficient conditions for learning with (dis)similarity functions. That is, neither is a subset of the other, as described in the following proposition:

*For every γ and*.*B*, there is a similarity function*s*(⋅, ⋅) and a learning problem*P*, such that*s*(⋅, ⋅) is (0, γ,*B*)-good for*P*in our sense, but not (0, γ/*B*)-good in Balcan and Blum's sense*For every γ, there is a similarity function*.*s*(⋅, ⋅) and a learning problem*P*, such that*s*(⋅, ⋅) is (0, γ)-good for*P*in Balcan and Blum's sense, but not (0,*Bγ*,*B*)-good in our sense for any B

Recently, Balcan, Blum, and Srebro (2008a) proposed an improved sufficient condition for learning with similarity functions: *s* is said to be a (ϵ, γ, τ)-good similarity function for a learning problem *P* if there is a random indicator variable *R* (depending on *x*′) such that and Pr (*R*) ≥ τ. Here *R* can be understood as a (stochastic) membership function on *x*′, indicating whether it is “important.”

This new notion of good similarity function and our definition of goodness are still different sufficient conditions, as shown in the following proposition:

*For every γ and**B*, there is a similarity function*s*(⋅, ⋅) and a learning problem*P*, such that*s*(⋅, ⋅) is (0, γ,*B*)-good for*P*in our sense but not (0, γ′, τ′)-good in the Balcan et al. improved sense for any γ′ and τ′ such that γ′ τ′ ≥ γ/*B*.*For every γ, there is a similarity function*.*s*(⋅, ⋅) and a learning problem*P*, such that*s*(⋅, ⋅) is (0, γ, τ)-good for*P*in the Balcan et al. improved sense, but not (0,*Bγ*′,*B*)-good in our sense for any B and any γ′ ≥*max*(γ, τ)

This is immediate from the previous proposition and the following relation between improved (ϵ, γ, τ)-goodness and the (ϵ, γ)-goodness described in Balcan et al. (2008a). A (ϵ, γ, τ)-good similarity function is also a (ϵ, *γ τ*)-good similarity function, and a (ϵ, γ′)-good similarity function is also a (ϵ, γ, τ) similarity function, where γ′ ≥ max(γ, τ).

To conclude the comparison, our (ϵ, γ, *B*)-goodness implies that the “order” of the (dis)similarity is important. That is, data from the same class are closer than those from different classes. But how much closer is not crucial. On the other hand, the (ϵ, γ)-goodness and its improvement of Balcan et al. imply that the average value of the similarity is important. From a practical viewpoint, if the user has some confidence that the data from the same class are more likely to be closer to each other, our result and the suggested boosting-type algorithm apply. This is especially suitable for the applications in which an appropriate scaling of the (dis)similarities is difficult or expensive. In case the (dis)similarities are well scaled so that there is a significant difference of the within-class and between-class (average) similarity, the SVM-type algorithm suggested by the (ϵ, γ)-goodness applies.

#### 2.3.2. Pseudo-Good Dissimilarity Functions.

We require in our definitions of good dissimilarity functions that most of the examples (at least 1−ϵ probability mass) are more likely to be close to a random example of the same class than to an example of the opposite class. Another natural but weaker notion would be the following:

Note that the pseudo-γ-goodness is a weaker notion than strong (ϵ, γ)-goodness, as described in the following proposition:

*If a dissimilarity function is strongly (ϵ, γ)-good for a learning problem, then it is also pseudo-γ′-good, where γ′ = (1−ϵ)γ −ϵ (if γ′ ≥ 0)*.

One might expect that the pseudo-goodness would also imply learnability, possibly in a weak sense. However, the next proposition shows that the majority voting scheme given in previous theorems in general does not guarantee any learnability, even in the weakest sense. This result means that our sufficient conditions for efficient learning with dissimilarity functions may not be weakened too much.

*z*= (

*x*,

*y*), when

*n*→ ∞, the law of large numbers gives that

*f*(

*x*) converges in probability to Note further that An error occurs,

*yf*(

*x*)<0, if Denote For 0 < γ < 1/2, it is easy to construct a distribution such that the following two inequalities hold simultaneously:

^{4}The first inequality is equivalent to equation 2.3, meaning that the dissimilarity function is pseudo-γ-good for the problem. The second inequality implies that the error rate of the voting classifier is larger than 1/2.

## 3. The DBoost Algorithm

In this section, we slightly modify the algorithm suggested by our theory to make it more suitable for practical use. The proposed algorithm is essentially a dissimilarity-based boosting and will be referred to as DBoost.

*d*, one needs to draw two sets of examples. One set contains

*n*pairs of examples ((

*x*′

_{i}, 1), (

*x*″

_{i}, −1)), with which we construct the base classifiers The other set of examples is used as training data for boosting to learn the thresholds

*v*and the combination coefficients α

_{i}_{i}so that the voting classifier has a low error and large margin on the training set.

In practice however, the users often have only one fixed set of examples. So in the DBoost algorithm, we try to make use of the data efficiently. Denote by *S* the set of data the user has. The DBoost algorithm first constructs the pairs (*x*′_{i}, 1), (*x*″_{i}, −1) by considering all possible pairs of examples with different labels in *S*. Then *S* is also served as the training set for boosting to learn the final large-margin convex-combination classifier.

*v*and the coefficients α

_{i}_{i}are learned by AdaBoost in a series of rounds. At the

*i*th round, we have a distribution

*D*over the training set

_{i}*S*. The algorithm then searches for the example pair (

*x*′

_{i},

*x*″

_{i}) and the threshold

*v*so that the base classifier, has the minimum training error on

_{i}*S*with respect to the distribution

*D*. The (unnormalized) coefficient α

_{i}_{i}is determined by this training error, and the distribution

*D*is also updated accordingly (see Figure 3 for details). After

_{i}*T*rounds, the DBoost outputs the final classifier, where One difficulty in the above procedure is that at each round, searching for the best example pair (

*x*′

_{i},

*x*″

_{i}) over all possible pairs is computationally expensive. We solve this problem in DBoost by searching for the best pair not over the whole set of possible pairs, but over only a small number

*M*of example pairs randomly selected at each round. The selection of these example pairs is according to the current distribution

*D*. It is well known that as the round

_{i}*i*increases, boosting makes the distribution

*D*put larger weights on the examples that are harder to classify. Therefore, the example pair would be two data near the classification boundary since they are the most difficult examples.

_{i}The DBoost algorithm is described in detail in Figures 2 and 3. Figure 2 shows the subroutine of searching for the pair (*x*′, *x*″) and the threshold *v* to construct the base classifier at a certain round of boosting. Figure 3 describes the boosting framework in DBoost, which is essentially the AdaBoost algorithm. Here we show the original version of the AdaBoost given by Freund and Schapire (1996). An improved version, RealBoost (Schapire & Singer, 1999), can often achieve better performance. RealBoost folds the coefficient α_{t} into the base classifier *h _{t}*, and hence

*h*outputs a real number. (For details, refer to Schapire & Singer, 1999.)

_{t}## 4. Experiments on Learning with Dissimilarity Functions

In this section, we perform experiments on algorithms that learn with dissimilarity functions. We compare DBoost to the nearest-neighbor rule and the algorithm based on Balcan and Blum's theory mentioned in section 1. The algorithms will be denoted by NN and LSVM for short, respectively. The details of the implementation are described below:

**DBoost:**For DBoost, there are two parameters to be specified. One is*M*in Figure 2, that is, the number of example pairs forming the search space. As discussed earlier,*M*is introduced to reduce the computational cost. We found that the value of*M*has no significant effect on the performance unless*M*is too small. So in all the experiments, we simply set*M*= 100. The other parameter is*T*in Figure 3: the number of base classifiers generated in boosting. We set*T*= 1000 in all the experiments.**NN:**We use the one-nearest neighbor (1NN) classifier.**LSVM:**This approach is based on Balcan and Blum's theory of learning with similarity functions. Given a similarity function*s*, we choose a set of prototypes {*p*_{1},*p*_{2},…,*p*_{r}}. For each training data*x*, calculate the similarity of*x*and the prototypes, and then construct the vector (*s*(*x, p*_{1}),*s*(*x, p*_{2}),…,*s*(*x, p*_{r})). This vector is treated as the feature representation of*x*. The final step is to run linear SVM on this*r*-dimensional space to obtain the classifier. As mentioned earlier, to have a learning guarantee for this algorithm, the similarity should be a normalized function. That is, ∣*s*∣ ≤ 1. So we need to transform the dissimilarity*d*to a normalized*s*. We consider The value of the parameter σ is tuned by cross-validation on the training set. In our implementation, we randomly select 20% of the training examples as prototypes, and we run libsvm (Chang & Lin, 2001) to obtain the linear SVM classifier.

In the first set of experiments, we study image classification. For some cases, it is more convenient to directly define dissimilarities between images than to construct meaningful features. Many dissimilarity measures of images have been proposed in the literature. We adopt in this experiment three measures: the tangent distance (Simard et al., 1993), the fuzzy image metric (Li et al., 2002), and the Euclidean distance. We perform the experiments on the USPS database, which consists of images of handwritten digits. The data set has been partitioned into a fixed training set and a test set, consisting of 7291 and 2007 examples, respectively.

Figure 4 shows the results of the three algorithms with the three dissimilarity measures, respectively. By using the tangent distance, NN achieves the best performance. In fact, the tangent distance is developed specifically for the handwritten-digit classification problem. It incorporates strong domain knowledge and is invariant to local within-class transformations (Simard et al., 1993). Therefore the tangent distance determines a good local topology for handwritten digit images. If two images have a very small tangent distance, they are highly likely to be in the same class.

The other two dissimilarities—the fuzzy image metric and the Euclidean distance—are more general-purpose measures. The information of the local distance alone is not enough for an accurate classification in this handwritten digit problem. So the performance of NN with these two dissimilarities is not excellent. On the other hand, DBoost works well in these cases. This result implies that the fuzzy image metric and the Euclidean distance are still good dissimilarity functions. In other words, they contain enough information to build accurate classifiers.

We then consider the dissimilarity with discrete (qualitative) values. For instance, when people make a subjective evaluation of the similarity between images, only qualitative (discrete) values can be given. For example, *similar*, *average*, and *dissimilar* are possible values of a three-level qualitative dissimilarity. We evaluate the algorithms on such qualitative measures. To conduct the experiments, Euclidean distances are quantized to 3 to 15, levels respectively. The results on the USPS data sets are shown in Figure 5: even with the three-level measures, for which the information of local topology is mostly lost, DBoost still has a low error rate.

We also evaluate the performance of the algorithms against noisy data. We add to the USPS images gaussian white noise with different variances and feed the Euclidean distance to the algorithm as input. The results are depicted in Figure 6, showing that DBoost is the most robust to noise.

In the second set of experiments, we evaluate the algorithms on a variety of domains. We adopt 22 benchmark data sets from the UCI repository (Asuncion & Newman, 2007). The aim is to see whether the DBoost algorithm works well for dissimilarity functions that are widely used in practice. Here we consider two dissimilarity measures: *l*_{1} and l_{∞}. In the experiments, each data set is used in a fivefold cross-validation fashion. The data sets are described in Table 1.

Data Set . | Number of Classes . | Number of Examples . | Data Set . | Number of Classes . | Number of Examples . |
---|---|---|---|---|---|

Balance | 3 | 625 | Letter | 26 | 20,000 |

Breast | 2 | 699 | Liver | 2 | 345 |

Cleveland | 2 | 297 | Monk1 | 2 | 556 |

Diabetes | 2 | 768 | Monk2 | 2 | 601 |

Echo | 2 | 106 | Monk3 | 2 | 554 |

German | 2 | 1000 | Satimage | 6 | 6435 |

Hays | 3 | 160 | Vehicle | 4 | 846 |

Hepatitis | 2 | 155 | Vowel | 11 | 990 |

Image | 7 | 2310 | Wdbc | 2 | 569 |

Ionosphere | 2 | 351 | Wine | 3 | 178 |

Iris | 3 | 150 | Wpbc | 2 | 194 |

Data Set . | Number of Classes . | Number of Examples . | Data Set . | Number of Classes . | Number of Examples . |
---|---|---|---|---|---|

Balance | 3 | 625 | Letter | 26 | 20,000 |

Breast | 2 | 699 | Liver | 2 | 345 |

Cleveland | 2 | 297 | Monk1 | 2 | 556 |

Diabetes | 2 | 768 | Monk2 | 2 | 601 |

Echo | 2 | 106 | Monk3 | 2 | 554 |

German | 2 | 1000 | Satimage | 6 | 6435 |

Hays | 3 | 160 | Vehicle | 4 | 846 |

Hepatitis | 2 | 155 | Vowel | 11 | 990 |

Image | 7 | 2310 | Wdbc | 2 | 569 |

Ionosphere | 2 | 351 | Wine | 3 | 178 |

Iris | 3 | 150 | Wpbc | 2 | 194 |

The results are listed in Table 2. For each data set, the algorithm that has the best performance and those that are comparable to the best according to the t-test at the significant level 0.01 are marked in boldface. From these results, one can see that on the whole, DBoost has the best performance with these two dissimilarities, even though they are not good measures for the NN classifier.

Data Set . | l_{1}
. | l_{∞}
. | ||||
---|---|---|---|---|---|---|

NN . | LSVM . | DBoost . | NN . | LSVM . | DBoost . | |

Balance | 20.2 ± 1.4 | 7.8 ± 2.9 | 5.6 ± 2.3 | 24.0 ± 3.9 | 6.1 ± 1.7 | 5.3 ± 2.6 |

Breast | 3.2 ± 1.7 | 2.5 ± 1.7 | 2.9 ± 1.8 | 5.0 ± 1.9 | 3.1 ± 2.1 | 3.4 ± 2.6 |

Cleveland | 22.6 ± 5.4 | 17.8 ± 6.0 | 20.5 ± 5.5 | 26.2 ± 6.0 | 27.0 ± 4.3 | 24.9 ± 8.5 |

Diabetes | 29.3 ± 1.7 | 23.8 ± 3.2 | 26.4 ± 2.8 | 29.0 ± 1.5 | 22.9 ± 2.8 | 27.2 ± 1.3 |

Echo | 21.7 ± 7.3 | 16.9 ± 3.8 | 12.3 ± 2.7 | 22.6 ± 7.8 | 16.9 ± 3.8 | 17.9 ± 3.9 |

German | 30.4 ± 2.9 | 26.7 ± 2.2 | 27.3 ± 2.2 | 33.1 ± 2.9 | 30.1 ± 1.8 | 29.5 ± 3.4 |

Hayes | 27.5 ± 7.4 | 16.2 ± 3.4 | 18.7 ± 3.2 | 32.5 ± 6.8 | 26.2 ± 4.7 | 21.3 ± 4.6 |

Hepatitis | 21.3 ± 9.3 | 18.4 ± 3.3 | 17.8 ± 9.6 | 19.1 ± 8.8 | 19.1 ± 6.6 | 19.1 ± 8.8 |

Image | 2.1 ± 0.6 | 3.4 ± 1.0 | 1.6 ± 0.4 | 3.3 ± 1.0 | 3.9 ± 1.0 | 2.7 ± 0.6 |

Ionosphere | 10.8 ± 2.8 | 6.8 ± 2.7 | 5.4 ± 2.9 | 13.1 ± 1.6 | 7.7 ± 2.9 | 5.7 ± 3.9 |

Iris | 7.3 ± 4.3 | 4.0 ± 3.7 | 6.7 ± 3.4 | 4.7 ± 3.8 | 2.7 ± 2.8 | 4.0 ± 3.7 |

Letter | 4.7 ± 0.3 | 4.2 ± 0.2 | 5.0 ± 0.1 | 8.8 ± 0.5 | 4.1 ± 0.4 | 6.3 ± 0.3 |

Liver | 40.3 ± 6.7 | 31.6 ± 4.7 | 30.4 ± 6.4 | 41.4 ± 3.8 | 37.4 ± 4.0 | 31.2 ± 5.7 |

Monk1 | 20.3 ± 4.6 | 8.6 ± 3.7 | 0.0 ± 0.0 | 20.1 ± 4.9 | 19.0 ± 2.8 | 1.8 ± 1.4 |

Monk2 | 14.0 ± 3.4 | 24.1 ± 2.5 | 1.7 ± 1.8 | 14.0 ± 3.3 | 27.8 ± 3.9 | 4.5 ± 2.8 |

Monk3 | 19.1 ± 9.1 | 3.6 ± 0.9 | 2.3 ± 0.8 | 24.5 ± 1.4 | 6.5 ± 2.0 | 4.1 ± 1.0 |

Satimage | 9.4 ± 1.0 | 8.8 ± 0.9 | 8.0 ± 0.9 | 12.1 ± 1.6 | 9.6 ± 1.0 | 9.8 ± 0.8 |

Vehicle | 30.9 ± 2.4 | 28.3 ± 0.9 | 27.0 ± 2.4 | 32.6 ± 4.9 | 28.8 ± 1.6 | 22.6 ± 2.1 |

Vowel | 2.0 ± 1.2 | 8.5 ± 2.0 | 2.7 ± 1.5 | 2.6 ± 1.0 | 7.7 ± 2.4 | 3.6 ± 1.4 |

Wdbc | 4.2 ± 1.4 | 3.5 ± 0.6 | 2.3 ± 0.8 | 6.1 ± 1.4 | 5.4 ± 1.4 | 5.8 ± 2.5 |

Wine | 4.5 ± 1.5 | 2.2 ± 2.3 | 3.9 ± 3.2 | 6.2 ± 3.7 | 1.7 ± 1.5 | 2.8 ± 2.8 |

Wpbc | 31.4 ± 3.8 | 23.7 ± 5.7 | 22.7 ± 4.1 | 37.1 ± 5.1 | 23.7 ± 5.7 | 26.3 ± 5.0 |

Data Set . | l_{1}
. | l_{∞}
. | ||||
---|---|---|---|---|---|---|

NN . | LSVM . | DBoost . | NN . | LSVM . | DBoost . | |

Balance | 20.2 ± 1.4 | 7.8 ± 2.9 | 5.6 ± 2.3 | 24.0 ± 3.9 | 6.1 ± 1.7 | 5.3 ± 2.6 |

Breast | 3.2 ± 1.7 | 2.5 ± 1.7 | 2.9 ± 1.8 | 5.0 ± 1.9 | 3.1 ± 2.1 | 3.4 ± 2.6 |

Cleveland | 22.6 ± 5.4 | 17.8 ± 6.0 | 20.5 ± 5.5 | 26.2 ± 6.0 | 27.0 ± 4.3 | 24.9 ± 8.5 |

Diabetes | 29.3 ± 1.7 | 23.8 ± 3.2 | 26.4 ± 2.8 | 29.0 ± 1.5 | 22.9 ± 2.8 | 27.2 ± 1.3 |

Echo | 21.7 ± 7.3 | 16.9 ± 3.8 | 12.3 ± 2.7 | 22.6 ± 7.8 | 16.9 ± 3.8 | 17.9 ± 3.9 |

German | 30.4 ± 2.9 | 26.7 ± 2.2 | 27.3 ± 2.2 | 33.1 ± 2.9 | 30.1 ± 1.8 | 29.5 ± 3.4 |

Hayes | 27.5 ± 7.4 | 16.2 ± 3.4 | 18.7 ± 3.2 | 32.5 ± 6.8 | 26.2 ± 4.7 | 21.3 ± 4.6 |

Hepatitis | 21.3 ± 9.3 | 18.4 ± 3.3 | 17.8 ± 9.6 | 19.1 ± 8.8 | 19.1 ± 6.6 | 19.1 ± 8.8 |

Image | 2.1 ± 0.6 | 3.4 ± 1.0 | 1.6 ± 0.4 | 3.3 ± 1.0 | 3.9 ± 1.0 | 2.7 ± 0.6 |

Ionosphere | 10.8 ± 2.8 | 6.8 ± 2.7 | 5.4 ± 2.9 | 13.1 ± 1.6 | 7.7 ± 2.9 | 5.7 ± 3.9 |

Iris | 7.3 ± 4.3 | 4.0 ± 3.7 | 6.7 ± 3.4 | 4.7 ± 3.8 | 2.7 ± 2.8 | 4.0 ± 3.7 |

Letter | 4.7 ± 0.3 | 4.2 ± 0.2 | 5.0 ± 0.1 | 8.8 ± 0.5 | 4.1 ± 0.4 | 6.3 ± 0.3 |

Liver | 40.3 ± 6.7 | 31.6 ± 4.7 | 30.4 ± 6.4 | 41.4 ± 3.8 | 37.4 ± 4.0 | 31.2 ± 5.7 |

Monk1 | 20.3 ± 4.6 | 8.6 ± 3.7 | 0.0 ± 0.0 | 20.1 ± 4.9 | 19.0 ± 2.8 | 1.8 ± 1.4 |

Monk2 | 14.0 ± 3.4 | 24.1 ± 2.5 | 1.7 ± 1.8 | 14.0 ± 3.3 | 27.8 ± 3.9 | 4.5 ± 2.8 |

Monk3 | 19.1 ± 9.1 | 3.6 ± 0.9 | 2.3 ± 0.8 | 24.5 ± 1.4 | 6.5 ± 2.0 | 4.1 ± 1.0 |

Satimage | 9.4 ± 1.0 | 8.8 ± 0.9 | 8.0 ± 0.9 | 12.1 ± 1.6 | 9.6 ± 1.0 | 9.8 ± 0.8 |

Vehicle | 30.9 ± 2.4 | 28.3 ± 0.9 | 27.0 ± 2.4 | 32.6 ± 4.9 | 28.8 ± 1.6 | 22.6 ± 2.1 |

Vowel | 2.0 ± 1.2 | 8.5 ± 2.0 | 2.7 ± 1.5 | 2.6 ± 1.0 | 7.7 ± 2.4 | 3.6 ± 1.4 |

Wdbc | 4.2 ± 1.4 | 3.5 ± 0.6 | 2.3 ± 0.8 | 6.1 ± 1.4 | 5.4 ± 1.4 | 5.8 ± 2.5 |

Wine | 4.5 ± 1.5 | 2.2 ± 2.3 | 3.9 ± 3.2 | 6.2 ± 3.7 | 1.7 ± 1.5 | 2.8 ± 2.8 |

Wpbc | 31.4 ± 3.8 | 23.7 ± 5.7 | 22.7 ± 4.1 | 37.1 ± 5.1 | 23.7 ± 5.7 | 26.3 ± 5.0 |

## 5. Experiments on Learning with Similarities

*s*(

*x*′,

*x*″), we need to change only the base classifier, in Figure 2 to We run experiments in the similarity setting with a music classification task. The goal is to evaluate the algorithms with a practical similarity measure. The data consist of 50 Japanese pop songs and 50 Japanese traditional songs (used in a fivefold cross-validation fashion). Each song is represented in the MIDI format, which contains a set of sequences of musical notes, cues, tones, volumes, and so on. We extract the main melody from each song and convert it to a string, where each character corresponds to a sixteenth note or a sixteenth rest.

The similarity measure we use for a pair of songs is the length of the longest common subsequence (LCS). The LCS of a string *s* and a string *t* is the longest (possibly nonconsecutive) sequence of characters, which appear in both *s* and *t*. For example, given two strings *s* = *aabbcc* and *t* = *abca*, the LCS of *s* and *t* is *abc*. The length of LCS is a natural measure to capture the similarity between two melodies since similar songs would share longer subsequences.

*a, b, ab, ba*}. It is easy to check that the LCS-based similarity matrix is which has a negative eigenvalue.

We evaluate four algorithms in this experiment. Besides NN, LSVM, and DBoost, we also consider ordinary SVM for this nonpositive similarity kernel. We adopt a common approach, which adds to the similarity (kernel) matrix a positive diagonal to enforce it to be positive definite. Figure 7 shows the performances of four algorithms. These results demonstrate that DBoost is promising for this simple and intuitive similarity measure.

## 6. Conclusion

In this work, we gave sufficient conditions for dissimilarity functions to allow one to learn well. We showed that if most examples are more likely to be close to a randomly selected example of the same class than to a random example of the other class, there is a simple algorithm that can learn well with the dissimilarity measure. The random selection of the examples could be according to arbitrary probability distributions satisfying a mild condition. Therefore, the sufficient condition captures a large class of dissimilarity functions. We also developed a more practical algorithm, named DBoost, under the theoretical guidance. DBoost learns a large-margin convex-combination classifier of a set of base classifiers, each of which depends on only the dissimilarities. Experimental results demonstrate that DBoost has good performance with several dissimilarity measures that are widely used in practice.

## Acknowledgments

We thank Masayuki Takeda for kindly providing us Japanese song data, and we also thank Kazuhito Hagio for preprocessing them. This work was supported by NSFC(60775005, 60635030) and Global COE Program of the Tokyo Institute of Technology.

## Notes

In equation 2.1 using γ/2 is to let γ in [0, 1].

In , we hide the negligible logarithm terms of the learning parameters.

For example: for some small δ.