Abstract

We propose a new formulation of multiple-instance learning (MIL), in which a unit of data consists of a set of instances called a bag. The goal is to find a good classifier of bags based on the similarity with a “shapelet” (or pattern), where the similarity of a bag with a shapelet is the maximum similarity of instances in the bag. In previous work, some of the training instances have been chosen as shapelets with no theoretical justification. In our formulation, we use all possible, and thus infinitely many, shapelets, resulting in a richer class of classifiers. We show that the formulation is tractable, that is, it can be reduced through linear programming boosting (LPBoost) to difference of convex (DC) programs of finite (actually polynomial) size. Our theoretical result also gives justification to the heuristics of some previous work. The time complexity of the proposed algorithm highly depends on the size of the set of all instances in the training sample. To apply to the data containing a large number of instances, we also propose a heuristic option of the algorithm without the loss of the theoretical guarantee. Our empirical study demonstrates that our algorithm uniformly works for shapelet learning tasks on time-series classification and various MIL tasks with comparable accuracy to the existing methods. Moreover, we show that the proposed heuristics allow us to achieve the result in reasonable computational time.

1  Introduction

Multiple-instance learning (MIL) is a fundamental framework of supervised learning with a wide range of applications, such as prediction of molecular activity, and image classification. It has been extensively studied in both theoretical and work (Gärtner et al., 2002; Andrews, Tsochantaridis, & Hofmann, 2003; Sabato & Tishby, 2012; Zhang, He, Si, & Lawrence, 2013; Doran & Ray, 2014; Carbonneau, Cheplygina, Granger, & Gagnon, 2018), since the notion of MIL was first proposed by Dietterich, Lathrop, and Lozano-Pérez (1997).

A standard MIL setting is described as follows: A learner receives sets B1,B2,,Bm called bags; each contains multiple instances. In the training phase, each bag is labeled, but instances are not labeled individually. The goal of the learner is to obtain a hypothesis that predicts the labels of unseen bags correctly.1 One of the most common hypotheses used in practice has the following form,
hu(B)=maxxBu,Φ(x),
(1.1)
where Φ is a feature map and u is a feature vector that we call a shapelet. In many applications, u is interpreted as a particular pattern in the feature space and the inner product as the similarity of Φ(x) from u. Note that we use the term shapelets by following the terminology of shapelet learning (SL), which is a framework for time-series classification, although it is often called concepts in the literature of MIL. Intuitively, this hypothesis evaluates a given bag by the maximum similarity among the instances in the bag and the shapelet u. The multiple-instance support vector machine (MI-SVM), proposed by Andrews et al. (2003), is a widely used algorithm that employs this hypothesis class and learns u. It is well known that MIL algorithms using this hypothesis class perform empirically better in various multiple-instance data sets. Moreover, a generalization error bound of the hypothesis class is given by Sabato and Tishby (2012).
However, in some domains, such as image recognition and document classification, it is said that the hypothesis class 1.1 is not effective (see, e.g., Chen, Bi, & Wang, 2006). To employ MIL on such domains more effectively, Chen et al. (2006) extend a hypothesis to a convex combination of hu,
g(B)=uUwumaxxBu,Φ(x),
(1.2)
for some set U of shapelets. In particular, Chen et al. (2006) consider Utrain={Φ(z)zi=1mBi}, which is constructed from all instances in the training sample. They demonstrate that this hypothesis with the gaussian kernel performs well in image recognition. The generalization bound provided by Sabato and Tishby (2012) is applicable to a hypothesis class of the form 1.2 for the set U of infinitely many shapelets u with bounded norm. Therefore, the generalization bound also holds for Utrain. However, it has never been theoretically discussed why such a fixed set Utrain using training instances effectively works in MIL tasks.

1.1  Our Contributions

In this letter, we propose an MIL formulation with the hypothesis class 1.2 for sets U of infinitely many shapelets.

The proposed learning framework is theoretically motivated and practically effective. We show the generalization error bound based on the Rademacher complexity (Bartlett & Mendelson, 2003) and large margin theory. The result indicates that we can achieve a small generalization error by keeping a large margin for a large training sample.

The learning framework can be applied to various kinds of data and tasks because of our unified formulation. The existing shapelet-based methods are formulated for their target domains. More precisely, the existing shapelet-based methods are formulated using a fixed similarity measure (or distance), and the generalization ability is shown empirically in their target domains. For example, Chen et al. (2006) and Sangnier, Gauthier, and Rakotomamonjy (2016) calculated the feature vectors based on the similarity between every instance using the gaussian kernel. In the time-series domain, shapelet-based methods (Ye & Keogh, 2009; Keogh & Rakthanmanon, 2013; Hills et al., 2014) usually use Euclidean distance as a similarity measure (or distance). By contrast, our framework employs a kernel function as a similarity measure. Therefore, our learning framework can be uniformly applied if we can set a kernel function as a similarity measure according to a target learning—for example, the gaussian kernel (which behaves like the Euclidean distance) and dynamic time warping (DTW) kernel (Shimodaira, Noma, Nakai, & Sagayama, 2001). Our framework can be also applied to non-real-valued sequence data (e.g., text and a discrete signal) using a string kernel. Moreover, our generalization performance is guaranteed theoretically. The experimental results demonstrate that the approach uniformly works for SL and MIL tasks without introducing domain-specific parameters and heuristics, and it compares with the state-of-the-art shapelet-based methods.

We show that the formulation is tractable. The algorithm is based on linear programming boosting (LPBoost; Demiriz, Bennett, & Shawe-Taylor, 2002), which solves the soft margin optimization problem via a column generation approach. Although the weak learning problem in the boosting becomes an optimization problem over an infinite-dimensional space, we can show that an analog of the representer theorem holds on it and allows us to reduce it to a nonconvex optimization problem (difference of convex program) over a finite-dimensional space. While it is difficult to solve the subproblems exactly because of nonconvexity, it is possible to find good approximate solutions with in reasonable time in many practical cases (Le Thi & Pham Dinh, 2018).

Remarkably, our theoretical result gives justification to the heuristics of choosing the shapelets in the training instances. Our representer theorem indicates that at the tth iteration of boosting, the optimal solution ut (i.e., shapelet) of the weak learning problem can be written as a linear combination of the feature maps of training instances, that is, ut=zi=1mBiαt,zΦ(z). Thus, we obtain a final classifier of the following form:
g(B)=t=1TwtmaxxBut,Φ(x)=t=1TwtmaxxBzi=1mBiαt,zΦ(z),Φ(x).
Note that the hypothesis class used in the standard approach (Chen et al., 2006; Sangnier et al., 2016) corresponds to the special case where utUtrain={Φ(z)zi=1mBi}. This observation would suggest that the standard approach of using Utrain is reasonable.

1.2  Comparison to Related Work for MIL

There are many MIL algorithms with hypothesis classes that are different from equations 1.1 or 1.2. (e.g., Auer & Ortner, 2004; Gärtner et al., 2002; Andrews & Hofmann, 2004; Zhang, Platt, & Viola, 2006; Chen et al., 2006). For example, these algorithms adopted diverse approaches for the bag-labeling hypothesis from shapelet-based hypothesis classes (e.g., Zhang et al., 2006, used a noisy-OR based hypothesis and Gärtner et al., 2002, proposed a new kernel called a set kernel). Shapelet-based hypothesis classes have a practical advantage of being applicable to SL in the time-series domain (see section 1.3).

Sabato and Tishby (2012) proved generalization bounds of hypotheses classes for MIL including those of equations 1.1 and 1.2 with infinitely large sets U. The generalization bound we provid in this letter is incomparable to the bound provided by Sabato and Tishby. When some data-dependent parameter is regarded as a constant, our bound is slightly better in terms of the sample size m by the factor of O(logm). They also proved the PAC learnability of the class 1.1 using the boosting approach under some technical assumptions. Their boosting approach is different from our work in that they assume that labels are consistent with some hypothesis of the form 1.1, while we consider arbitrary distributions over bags and labels.

1.3  Connection between MIL and Shapelet Learning for Time-Series Classification

Here we mention briefly that MIL with type 1.2 hypotheses is closely related to SL, a framework for time-series classification that has been extensively studied (Ye & Keogh, 2009; Keogh & Rakthanmanon, 2013; Hills, Lines, Baranauskas, Mapp, & Bagnall, 2014; Grabocka, Schilling, Wistuba, & Schmidt-Thieme, 2014) in parallel to MIL. SL is a notion of learning with a feature extraction method, defined by a finite set MR of real-valued “short” sequences called shapelets. A similarity measure is given by (not necessarily a Mercer kernel) K:R×RR in the following way. A time series τ=(τ[1],,τ[L])RL can be identified with a bag Bτ={(τ[j],,τ[j+-1])1jL-+1} consisting of all sub-sequences of τ of length . The feature of τ is a vector maxxBτK(z,x)zM of a fixed dimension |M| regardless of the length L of the time series τ. When we employ a linear classifier on top of the features, we obtain a hypothesis in the form
g(τ)=zMwzmaxxBτK(z,x),
(1.3)
which is essentially the same form as equation 1.2, except that finding good shapelets M is a part of the learning task, as well as to find a good weight vector w. This task is one of the most successful approaches for SL (Hills et al., 2014; Grabocka et al., 2014, 2015; Renard, Rifqi, Erray, & Detyniecki, 2015; Hou, Kwok, & Zurada, 2016), where a typical choice of K is K(z,x)=-z-x2. However, almost all existing methods heuristically choose shapelets M and with no theoretical guarantee on how good the choice of M is.

Note also that in the SL framework, each zM is called a shapelet, while in this letter, we assume that K is a kernel K(z,x)=Φ(z),Φ(x) and any u (not necessarily Φ(z) for some z) in the Hilbert space is called a shapelet.

Sangnier et al. (2016) proposed an MIL-based anomaly detection algorithm for time-series data. They showed an algorithm based on LPBoost and the generalization error bound based on the Rademacher complexity (Bartlett & Mendelson, 2003). Their hypothesis class is same as the of Chen et al. (2006). However, they did not analyze the theoretical justification to use finite set U made from training instances (however, they mentioned it as future work). By contrast, we consider a hypothesis class based on infinitely many shapelets, and our representer theorem guarantees that our learning problem over the infinitely large set is still tractable. As a result, our study justifies the previous heuristics of their approach.

There is another work that treats shapelets not appearing in the training set. The learning time-series shapelets (LTS) algorithm (Grabocka et al., 2014) tries to solve a nonconvex optimization problem of learning effective shapelets in an infinitely large domain. However, there is no theoretical guarantee of its generalization error. In fact, our generalization error bound applies to their hypothesis class.

For SL tasks, many researchers focus on improving efficiency (Keogh & Rakthanmanon, 2013; Renard et al., 2015; Grabocka, Wistuba, & Schmidt-Thieme, 2015; Wistuba, Grabocka, & Schmidt-Thieme, 2015; Hou et al., 2016; Karlsson, Papapetrou, & Boström, 2016). However, these methods are specialized in the time-series domain, and the generalization performance has never been theoretically discussed.

Curiously, although MIL and SL share similar motivations and hypotheses, the relationship between them has not yet been pointed out. From the shapelet-perspective in MIL, hypothesis 1.1 is regarded as a “single shapelet”–based hypothesis, and hypothesis 1.2 is regarded as a “multiple-shapelets”–based hypothesis. In this study, we refer to a linear combination of maximum similarities based on shapelets such as equations 1.2 and (1.3) as shapelet-based classifiers.

2  Preliminaries

Let X be an instance space. A bag B is a finite set of instances chosen from X. The learner receives a sequence of labeled bags S=((B1,y1),,(Bm,ym))(2X×{-1,1})m called a sample, where each labeled bag is independently drawn according to some unknown distribution D over 2X×{-1,1}. Let PS denote the set of all instances that appear in the sample S. That is, PS=i=1mBi. Let K be a kernel over X, which is used to measure the similarity between instances, and let Φ:XH denote a feature map associated with the kernel K for a Hilbert space H, that is, K(z,z')=Φ(z),Φ(z') for instances z,z'X, where ·,· denotes the inner product over H. The norm induced by the inner product is denoted by ·H defined as uH=u,u for uH.

For each uH which we call a shapelet, we define a shapelet-based classifier denoted by hu, as the function that maps a given bag B to the maximum of the similarity scores between shapelet u and Φ(x) over all instances x in B. More specifically,
hu(B)=maxxBu,Φ(x).
For a set UH, we define the class of shapelet-based classifiers as
HU=huuU
and let conv(HU) denote the set of convex combinations of shapelet-based classifiers in HU. More precisely,
conv(HU)=uUwuhuduwuisadensityoverU=uU'wuhuuU',wu0,uU'wu=1,U'Uisafinitesupport.
(2.1)
The goal of the learner is to find a hypothesis gconv(HU), so that its generalization error ED(g)=Pr(B,y)D[sign(g(B))y] is small. Note that since the final hypothesis signg is invariant to any scaling of g, we assume without loss of generality that
U={uHuH1}.
Let Eρ(g) denote the empirical margin loss of g over S, that is, Eρ(g)=|{iyig(Bi)<ρ}|/m.

3  Optimization Problem Formulation

In this letter, we formulate the problem as soft margin maximization with 1-norm regularization, which ensures a generalization bound for the final hypothesis (see, e.g., Demiriz et al., 2002). Specifically, the problem is formulated as a linear programming problem (over infinitely many variables) as follows:
maxρ,w,xiρ-1νmi=1mξisub.touUyiwuhu(Bi)duρ-ξiξi0,i[m],uUwudu=1,wu0,ρR,
(3.1)
where ν[0,1] is a parameter. To avoid the integral over the Hilbert space, it is convenient to consider the dual form:
minγ,dγsub.toi=1myidihu(Bi)γ,uU,0di1/(νm),i[m],i=1mdi=1,γR.
(3.2)
The dual problem is categorized as a semi-infinite program because it contains infinitely many constraints. Note that the duality gap is zero because problem 3.2 is linear and the optimum is finite (see theorem 2.2 of Shapiro, 2009). We employ column generation to solve the dual problem: solve equation 3.2 for a finite subset U'U, find u to which the corresponding constraint is maximally violated by the current solution (column generation part), and repeat the procedure with U'=U'{u} until a certain stopping criterion is met. In particular, we use LPBoost (Demiriz et al., 2002), a well-known and practically fast algorithm of column generation. Since the solution w is expected to be sparse due to the 1-norm regularization, the number of iterations is expected to be small.
Following the boosting terminology, we refer to the column generation part as weak learning. In our case, weak learning is formulated following the optimization problem:
maxuHi=1myidimaxxBiu,Φxsub.touH21.
(3.3)
Thus, we need to design a weak learner for solving equation 3.3 for a given sample weighted by d. However, it seems to be impossible to solve it directly because we have access to U only through the associated kernel. Fortunately, we prove a version of representer theorem given below, which makes equation 3.3 tractable.
Theorem 1

(Representer Theorem). The solution u* of equation 3.3 can be written as u*=zPSαzΦ(z) for some real numbers αz.

Our theorem can be derived from a nontrivial application of the standard representer theorem (see, e.g., Mohri, Rostamizadeh, & Talwalkar, 2012). Intuitively, we prove the theorem by decomposing the optimization problem 3.3 into a number of subproblems, so that the standard representer theorem can be applied to each of the subproblems. The details of the proof are given in appendix A.

This result gives justification to the simple heuristics in the standard approach: choosing the shapelets based on the training instances. More precisely, the hypothesis class used in the standard approach (Chen et al., 2006; Sangnier et al., 2016) corresponds to the special case where uUtrain={Φ(z)zPS}. Thus, our representer theorem would suggest that the standard approach of using Utrain is reasonable.

Theorem 1 says that the weak learning problem can be rewritten in the following tractable form:

  • OP1. Weak Learning Problem
    minα-i=1mdiyimaxxBizPSαzKz,xsub.tozPSvPSαzαvKz,v1.

Unlike the primal solution w, the dual solution α is not expected to be sparse. To obtain a more interpretable hypothesis, we propose another formulation of weak learning where 1-norm regularization is imposed on α, so that a sparse solution of α will be obtained. In other words, instead of U, we consider the feasible set U^=zPSαzΦ(z):α11, where α1 is the 1-norm of α.

  • OP2. Sparse Weak Learning Problem
    minα-i=1mdiyimaxxBizPSαzKz,xsub.toα11

Note that when running LPBoost with a weak learner for OP 2, we obtain a final hypothesis that has the same form of generalization bound as the one stated in theorem 2, which is of a final hypothesis obtained when used with a weak learner for OP 1. To see this, consider a feasible space U^Λ=zPSαzΦ(z):α1Λ for a sufficiently small Λ>0, so that U^ΛU. Then, since HU^ΛHU, a generalization bound for HU also applies to HU^Λ. On the other hand, since the final hypothesis signg for gconv(HU^Λ) is invariant to the scaling factor Λ, the generalization ability is independent of Λ.

formula
formula

4  Algorithms

In this section, we present the pseudocode of LPBoost in algorithm 1 for completeness. Moreover, we describe our algorithms for the weak learners. For simplicity, we denote by kxRPS a vector given by kx,z=K(z,x) for every zPS. The objective function of OP 1 (and OP 2) is rewritten as
i:yi=-1dimaxxBikxTα-i:yi=1dimaxxBikxTα,
which can be seen as a difference F-G of two convex functions F and G of α. Therefore, the weak learning problems are DC programs, and thus we can use the DC algorithm (Tao & Souad, 1988; Yu & Joachims, 2009) to find an ε-approximation of a local optimum. We employ a standard DC algorithm. That is, for each iteration t, we linearize the concave term G with αG(αt)Tα at the current solution αt, which is i:yi=1dikxi*Tα with xi*=argmaxxBikxTα in our case, and then update the solution to αt+1 by solving the resultant convex optimization problem OPt'.
In addition, the problems OPt' for OP 1 and OP 2 are reformulated as a second-order cone programming (SOCP) problem and an LP problem, respectively, and thus both problems can be efficiently solved. To this end, we introduce new variables λi for all negative bags Bi with yi=-1, which represent the factors maxxBikxTα. Then we obtain the equivalent problem to OPt' for OP 1 as follows:
minα,λi:yi=-1diλi-i:yi=1dimaxxBikxTαsub.tokxTαλi(i:yi=-1,xBi),zPSvPSαzαvKz,v1.
(4.1)
It is well known that this is an SOCP problem. Moreover, it is clear that OPt' for OP 2 can be formulated as an LP problem. We describe the algorithm for OP 1 in algorithm 2.

One might be concerned concern that a kernel matrix could become large when a sample consists a large number of bags and instances. However, note that the kernel matrix of K(z,x), which is used in algorithm 2, needs to be computed only once at the beginning of algorithm 1, not at every iteration.

As a result, our learning algorithm outputs a classifier,
g(B)=signt=1TwtmaxxBzPSαt,zK(z,x),
where wt and αt are obtained in training phase. Therefore, the computational cost for predicting the label of B is O(T|PS||B|) in the worst case when all elements of αt,z are nonzero. However, when we employ our sparse formulation OP 2, which allows us to find a sparse α, the computational cost is expected to be much smaller than the worst case.

5  Generalization Bound of the Hypothesis Class

In this section, we provide a generalization bound of hypothesis classes conv(HU) for various U and K.

Let Φ(PS)={Φ(z)zPS}. Let Φdiff(PS)={Φ(z)-Φ(z')z,z'PS,zz'}. By viewing each instance vΦdiff(PS) as a hyperplane {uv,u=0}, we can naturally define a partition of the Hilbert space H by the set of all hyperplanes vΦdiff(PS). Let I be the set of all cells of the partition, that is, I={II=vV{uv,u>0},I,VΦdiff(PS),vV-vVforallvΦdiff(PS)}. Each cell II is a polyhedron defined by a minimal set VIΦdiff(PS) that satisfies I=vVI{uu,v>0}. Let
μ*=minIImaxuIUminvVI|u,v|.
Let dΦ,S* be the VC dimension of the set of linear classifiers over the finite set Φdiff(PS), given by FU={f:vsign(u,v)uU}.

Then we have the following generalization bound on the hypothesis class of equation 1.2:

Theorem 2.
Let Φ:XH. Suppose that for any zX, Φ(z)HR. Then, for any ρ>0, with high probability the following holds for any gconv(HU) with U{uHuH1}:
ED(g)Eρ(g)+ORdΦ,S*log|PS|ρm,
(5.2)
where (i) for any Φ, dΦ,S*=O((R/μ*)2), (ii) if XR and Φ is the identity mapping (i.e., the associated kernel is the linear kernel), or (iii) if XR and Φ satisfies the condition that Φ(z),Φ(x) is monotone decreasing with respect to z-x2 (e.g., the mapping defined by the gaussian kernel) and U={Φ(z)zR,Φ(z)H1}, then dΦ,S*=O(min((R/μ*)2,)).

We show the proof in appendix B.

5.1  Comparison with the Existing Bounds

A similar generalization bound can be derived from a known bound of the Rademacher complexity of HU (see theorem 20 of Sabato & Tishby, 2012) and a generalization bound of conv(H) for any hypothesis class H (see corollary 6.1 of Mohri et al., 2012):
ED(g)Eρ(g)+Ologi=1m|Bi|log(m)ρm.
Note that Sabato and Tishby (2012) fixed R=1. For simplicity, we omit some constants of theorem 20 of Sabato and Tishby (2012). Note that |PS|i=1m|Bi| by definition. The bound above is incomparable to theorem 2 in general, as ours uses the parameter dΦ,S* and the other has the extra logi=1m|Bi|log(m) term. However, our bound is better in terms of the sample size m by the factor of O(logm) when other parameters are regarded as constants.

6  SL by MIL

6.1  Time-Series Classification with Shapelets

In the following, we introduce a framework of time-series classification problem based on shapelets (i.e., the SL problem). As mentioned in the previous section, a time-series τ=(τ[1],,τ[L])RL can be identified with a bag Bτ={(τ[j],,τ[j+-1])1jL-+1} that consists of all subsequences of τ of length . The learner receives a labeled sample S=((Bτ1,y1),,(Bτm,ym))(2R×{-1,1})m, where each labeled bag (i.e., labeled time series) is independently drawn according to some unknown distribution D over a finite support of 2R×{-1,+1}. The goal of the learner is to predict the labels of an unseen time series correctly. In this way, the SL problem can be viewed as an MIL problem, and thus we can apply our algorithms and theory.

Note that for time-series classification, various similarity measures can be represented by a kernel—for example, the gaussian kernel (behaves like the Euclidean distance) and the dynamic time warping (DTW) kernel. Moreover, our framework can generally apply to non-real-valued sequence data (e.g., text, and a discrete signal) using a string kernel.

6.2  Our Theory and Algorithms for SL

By theorem 2, we can immediately obtain the generalization bound of our hypothesis class in SL as follows:

Corollary 1.
Consider time-series sample S of size m and length L. For any fixed <L, the following generalization error bound holds for all gconv(HU) in which the length of shapelet is :
ED(g)Eρ(g)+ORdΦ,S*log(m(L-+1))ρm.

To the best of our knowledge, this is the first result on the generalization performance of SL.

Theorem 1 gives justification to the heuristics that choose the shapelets extracted from the instances appearing in the training sample (i.e., the sub-sequences for SL tasks). Moreover, several methods using a linear combination of shapelet-based classifiers (e.g., Hills et al., 2014; Grabocka et al., 2014) are supported by corollary 6.

For time-series classification problems, shapelet-based classification has a greater advantage of the interpretability or visibility than other time-series classification methods (see, e.g., Ye & Keogh, 2009). Although we use a nonlinear kernel function, we can observe important sub-sequences that contribute to effective shapelets by solving OP 2 because of the sparsity (see also the experimental results). Moreover, for unseen time-series data, we can observe the types of sub-sequences that contribute to the predicted class by observing maximizer xB.

6.3  Learning Shapelets of Different Lengths

For time-series classification, many existing methods take advantage of using shapelets of various lengths. Below, we show that our formulation can be easily applied to the case.

A time series τ=(τ[1],,τ[L])RL can be identified with a bag Bτ={(τ[j],,τ[j+-1])1jL-+1,Q} that consists of all length Q{1,,L} of sub-sequences of τ. That is, this is also a special case of MIL that a bag contains different dimensional instances.

There is a simple way to apply our learning algorithm to this case. We employ some kernels K(z,x) that support different dimensional instance pairs z and x. Fortunately, such kernels have been studied well in the time-series domain. For example, the DTW kernel and global alignment kernel (Cuturi, 2011) are well-known kernels that support time series of different lengths. However, the size of the kernel matrix of K(z,x) becomes m(Q(L-+1))2. In practice, it requires a high memory cost for large time-series data. Moreover, in general, the above kernel requires a higher computational cost than standard kernels.

We introduce a practical way to learn shapelets of different lengths based on heuristics. In each weak learning problem, we decomposed the original weak learning problem over different dimensional data space into the weak learning problems over each dimensional data space. For example, we consider solving the following problem instead of the weak learning problem OP 1,
minminα-i=1mdiyimaxxBizPSαzKz,x,sub.tozPSvPSαzαvKz,v1,
where Bi denotes the dimensional instances (i.e., length of sub-sequences) in Bi, and PS denotes i=1mBi. The total size of kernel matrices becomes mQ((L-+1))2, and thus this method does not require such a large kernel matrix. Moreover, in this way, we do not need to use a kernel that supports different dimensional instances. Note that even using this heuristic, the obtained final hypothesis has theoretical generalization performance. This is because the hypothesis class is still represented as the form of equation 2.1. In our experiment, we use the latter method by giving weight to memory efficiency.

6.4  Heuristics for Computational Efficiency

For the practical applications, we introduce some heuristics for improving efficiency in our algorithm.

6.4.1  Reduction of PS

Especially for time-series data, the size |PS| often becomes large because |PS|=O(mL). Therefore, constructing a kernel matrix of |PS|×|PS| has high computational costs for time-series data. For example, when we consider sub-sequences as instances for time-series classification, we have a large computational cost because of the number of sub-sequences of training data (e.g., approximately 106 when the sample size is 1000 and the length of each time series is 1000, which results in a similarity matrix of size 1012). However, in most cases, many sub-sequences in time-series data are similar to each other. Therefore, we only use representative instances P^S instead of the set of all instances PS. In this letter, we use k-means clustering to reduce the size of |PS|. Note that our heuristic approach is still supported by our theoretical generalization error bound. This is because the hypothesis set HU' with the reduced shapelets U' is the subset of HU, and the Rademacher complexity of HU' is exactly smaller than the Rademacher complexity of HU. Thus, theorem 2 holds for the hypothesis class considering the set HU of all possible shapelets U, and thus that theorem also holds for the hypothesis class using the set HU' of some reduced shapelets U'. Although this approach may decrease the training classification accuracy in practice, it drastically decreases the computational cost for a large data set.

6.4.2  Initialization in Weak Learning Problem

The DC program may slowly converge to a local optimum depending on the initial solution. In algorithm 2, we fix an initial α0 as follows. More precisely, we initially solve
α0=argmaxαi=1mdiyimaxxBizPSαzKz,x,sub.toαisaone-hotvector.
(6.1)
That is, we choose the most discriminative shapelet from PS as the initial point of u for given d. We expect that it will speed up the convergence of the loop of line 3, and the obtained classifier is better than the methods that choose effective shapelets from subsequences.

7  Experiments

In this section, we show some experimental results implying that our algorithm performs comparably to the existing shapelet-based classifiers for both SL and MIL tasks.

7.1  Results for Time-Series Data

We use several binary labeled data sets2 in UCR data sets (Chen et al., 2015), which are often used as benchmark data sets for time-series classification methods. We used a weak learning problem OP 2 because the interpretability of the obtained classifier is required in shapelet-based time-series classification.

We compare the following three shapelet-based approaches:

  • Shapelet transform (ST) provided by Bagnall, Lines, Bostrom, Large, and Keogh (2017)

  • Learning time-series shapelets (LTS) provided by Grabocka et al. (2014)

  • Our algorithm using shapelets of different lengths (which we will refer to as Ours)

We used the implementation of ST provided by Löning et al. (2019), and used the implementation of LTS provided by Tavenard, Faouzi, and Vandewiele (2017). The classification rule of shapelets transform has the form
g(B)=fmaxxB-z1-x,,maxxB-zk-x,
where f is a user-defined classification function (the implementation employs decision forest), z1,,zkPS (in the time-series domain, this zj is called a shapelet). The shapelets are chosen from training sub-sequences in some complicated way before learning f. The classification rule of learning time-series shapelets has the form
g(B)=j=1kwjmaxxB-zj-x,
where wjR and zjR are learned parameters and the number of desired shapelets k is a hyperparameter.

Below we show the detailed condition of the experiment. For ST, we set the shapelet lengths {2,,L/2}, where L is the length of each time series in the data set. ST also requires a parameter of a time limit for searching shapelets, and we set it as 5 hours for each data set. For LTS, we used the hyperparameter sets (e.g., regularization parameter, number of shapelets) that the authors recommended in their website,3 and we found an optimal hyperparameter by 3-fold cross-validation for each data set. For our algorithms, we implemented a weak learning algorithm that supports shapelets of different lengths (see section 6.3). In this experiment, we consider the case that each bag contains lengths {0.05,0.1,0.15,,0.5}×L of the sub-sequences. We used the gaussian kernel K(x,x')=exp(-x-x'2σ2) and chose 1/σ2 from {0.01,0.05,0.1,,50}. We chose ν from {0.1,0.2,0.3,0.4}. We use 100-means clustering with respect to each class to reduce PS. The parameters we should tune are only ν and σ. We tuned these parameters via a procedure we give in appendix B.1. As an LP solver for WeakLearn and LPBoost, we used the CPLEX software. In addition to Ours, LTS employs k-means clustering to set the initial shapelets in the optimization algorithm. Therefore, we show the average accuracies for LTS and Ours considering the randomness of k-means clustering.

The classification accuracy results are shown in Table 1. We can see that our algorithms achieve performance comparable to that of ST and LTS. We conducted the Wilcoxon signed-rank test between Ours and the others. The p-value of the Wilcoxon signed-rank test for Ours and ST is 0.1247. The p-value of the Wilcoxon signed-rank test for Ours and LTS is 0.6219. The p-values are higher than 0.05, and thus we cannot rejcect that there is no significant difference between the medians of the accuracies. We can say that our MIL algorithm works well for time-series classification tasks without using domain-specific knowledge.

Table 1:
Classification Accuracies for Time-Series Data Sets.
Data SetSTLTSOurs
BeetleFly 0.8 0.765 0.835 
BirdChicken 0.9 0.93 0.935 
Coffee 0.964 1 0.964 
Computers 0.704 0.619 0.623 
DistalPhalanxOutlineCorrect 0.757 0.714 0.802 
Earthquakes 0.741 0.748 0.728 
ECG200 0.85 0.835 0.872 
ECGFiveDays 0.999 0.961 1 
FordA 0.856 0.914 0.89 
FordB 0.74 0.9 0.786 
GunPoint 0.987 0.971 0.987 
Ham 0.762 0.782 0.698 
HandOutlines 0.919 0.892 0.87 
Herring 0.594 0.652 0.588 
ItalyPowerDemand 0.947 0.951 0.943 
Lightning2 0.639 0.695 0.779 
MiddlePhalanxOutlineCorrect 0.794 0.579 0.632 
MoteStrain 0.927 0.849 0.845 
PhalangesOutlinesCorrect 0.773 0.633 0.792 
ProximalPhalanxOutlineCorrect 0.869 0.742 0.844 
ShapeletSim 0.994 0.989 1 
SonyAIBORobotSurface1 0.932 0.903 0.841 
SonyAIBORobotSurface2 0.922 0.895 0.887 
Strawberry 0.941 0.844 0.947 
ToeSegmentation1 0.956 0.947 0.906 
ToeSegmentation2 0.792 0.886 0.823 
TwoLeadECG 0.995 0.981 0.949 
Wafer 1 0.993 0.991 
Wine 0.741 0.487 0.72 
WormsTwoClass 0.831 0.752 0.608 
Yoga 0.847 0.69 0.804 
Data SetSTLTSOurs
BeetleFly 0.8 0.765 0.835 
BirdChicken 0.9 0.93 0.935 
Coffee 0.964 1 0.964 
Computers 0.704 0.619 0.623 
DistalPhalanxOutlineCorrect 0.757 0.714 0.802 
Earthquakes 0.741 0.748 0.728 
ECG200 0.85 0.835 0.872 
ECGFiveDays 0.999 0.961 1 
FordA 0.856 0.914 0.89 
FordB 0.74 0.9 0.786 
GunPoint 0.987 0.971 0.987 
Ham 0.762 0.782 0.698 
HandOutlines 0.919 0.892 0.87 
Herring 0.594 0.652 0.588 
ItalyPowerDemand 0.947 0.951 0.943 
Lightning2 0.639 0.695 0.779 
MiddlePhalanxOutlineCorrect 0.794 0.579 0.632 
MoteStrain 0.927 0.849 0.845 
PhalangesOutlinesCorrect 0.773 0.633 0.792 
ProximalPhalanxOutlineCorrect 0.869 0.742 0.844 
ShapeletSim 0.994 0.989 1 
SonyAIBORobotSurface1 0.932 0.903 0.841 
SonyAIBORobotSurface2 0.922 0.895 0.887 
Strawberry 0.941 0.844 0.947 
ToeSegmentation1 0.956 0.947 0.906 
ToeSegmentation2 0.792 0.886 0.823 
TwoLeadECG 0.995 0.981 0.949 
Wafer 1 0.993 0.991 
Wine 0.741 0.487 0.72 
WormsTwoClass 0.831 0.752 0.608 
Yoga 0.847 0.69 0.804 

Note: The best accuracies are highlighted in bold.

To compare the computation time of these methods, we selected the data sets for which these three methods have achieved similar performance. The experiments were performed on Intel Xeon Gold 6154, 36 core CPU, and 192 GB memory. Table 2 compares the running times of the training. Note that again, for ST, we set the limitation of the running time as 5 hours for finding good shapelets. This running time limitation is a hyperparameter of the code, and it is difficult to estimat it before experiments. LTS efficiently worked compared with ST and Ours. However, it seems that LTS achieved lower performance than ST and Ours on accuracy. Table 3 shows the testing time of the methods. LTS also efficiently worked, simply because it finds effective shapelets of a fixed number (hyperparameter). ST and Ours may find a large number of shapelets, and this increases the computation time of prediction. For the Wafer data set, ST and Ours required a large computation time compared with LTS.

Table 2:
Training Time (Sec.) for Several Time-Series Data Sets.
Data SetNumber of Training DataLengthSTLTSOurs
Earthquakes 322 512 18,889.8 250.5 1339.2 
GunPont 50 150 18,016.2 22.3 36.9 
ItalyPowerDemand 67 24 18,000.8 11.5 8.6 
ShapeletSim 20 180 18,011.6 30.4 32.8 
Wafer 1000 152 18,900.8 91.5 431.7 
Data SetNumber of Training DataLengthSTLTSOurs
Earthquakes 322 512 18,889.8 250.5 1339.2 
GunPont 50 150 18,016.2 22.3 36.9 
ItalyPowerDemand 67 24 18,000.8 11.5 8.6 
ShapeletSim 20 180 18,011.6 30.4 32.8 
Wafer 1000 152 18,900.8 91.5 431.7 
Table 3:
Testing Time (Sec.) for Several Time-Series Data Sets.
Data SetNumber of Test DataLengthSTLTSOurs
Earthquakes 139 512 389.7 2.75 11.55 
GunPont 150 150 48.0 1.1 3.9 
ItalyPowerDemand 1029 24 3.3 0.5 10.7 
ShapeletSim 180 180 104.0 1.8 1.1 
Wafer 6164 152 5688.2 4.3 173.1 
Data SetNumber of Test DataLengthSTLTSOurs
Earthquakes 139 512 389.7 2.75 11.55 
GunPont 150 150 48.0 1.1 3.9 
ItalyPowerDemand 1029 24 3.3 0.5 10.7 
ShapeletSim 180 180 104.0 1.8 1.1 
Wafer 6164 152 5688.2 4.3 173.1 

We cannot fairly compare the efficiency of these methods because the implementation environments (e.g., programming languages) are different. However, we can say that the proposed method achieved high classification accuracy with reasonable running time for training and prediction.

7.1.1  Interpretability of Our Method

We would like to show the interpretability of our method. We use the CBF data set, which contains three classes (cylinder, bell, and funnel) of time series. The reason is that it is known that the discriminative patterns are clear, and thus we can easily ascertain if the obtained hypothesis can capture the effective shapelets. For simplicity, we obtain a binary classification model for each class preparing one-versus-others training set. We used Ours with fixed shapelet length =25. We now introduce two types of visualization approaches to interpret a learned model.

One is the visualization of the characteristic sub-sequences of an input time series. When we predict the label of the time series B, we calculate a maximizer x* in B for each hu, that is, x*=argmaxxBu,Φ(x). For image recognition tasks, the maximizers are commonly used to observe the subimages that characterize the class of the input image (e.g., Chen et al., 2006). In time-series classification tasks, the maximizers also can be used to observe some characteristic sub-sequences. Figure 1 is an example of a visualization of maximizers. Each value in the legend indicates wumaxxBu,Φ(x). That is, sub-sequences with positive values contribute to the positive class, and sub-sequences with negative values contribute to the negative class. Such visualization provides the sub-sequences that characterize the class of the input time series. For the cylinder class, although both positive and negative patterns match almost the same sub-sequence, the positive pattern is stronger than the negative, and thus the hypothesis can correctly discriminate the time series. For the bell and funnel classes, we can observe that the highlighted sub-sequences clearly indicate the discriminative patterns.

Figure 1:

Examples of the visualization of maximizers for CBF time-series data. Black lines are the original time series. We highlight each sub-sequence that maximizes the similarity with some shapelet in a classifier. Sub-sequences with positive values (red) contribute to the positive class, and sub-sequences with negative values (blue) contribute to the negative class.

Figure 1:

Examples of the visualization of maximizers for CBF time-series data. Black lines are the original time series. We highlight each sub-sequence that maximizes the similarity with some shapelet in a classifier. Sub-sequences with positive values (red) contribute to the positive class, and sub-sequences with negative values (blue) contribute to the negative class.

The other is the visualization of a final hypothesis g(B)=j=1twjhj(B), where hj(B)=maxxBzjP^Sαj,zjK(zj,x) (P^S is the set of representative sub-sequences obtained by k-means clustering). Figure 2 is an example of the visualization of a final hypothesis obtained by our algorithm. The colored lines are all the zjs in g where both wj and αj,zj were nonzero. Each legend value shows the multiplication of wj and αj,zj corresponding to zj. That is, positive values of the colored lines indicate the contribution rate for the positive class, and negative values indicate the contribution rate for the negative class. Note that because it is difficult to visualize the shapelets over the Hilbert space associated with the gaussian kernel, we plotted each of them to match the original time series based on the Euclidean distance. Unlike the previous visualization analyses (see, e.g., Ye & Keogh, 2009), our visualization does not exactly interpret the final hypothesis because of the nonlinear feature map. However, we can deduce that the colored lines represent “important patterns,” which make significant contributions to classification.

Figure 2:

Examples of the visualization of shapelets for CBF time-series data. The colored lines show important patterns of the obtained classifier. Positive values on the colored lines (red to yellow) indicate the contribution rate for the positive class, and negative values (blue to purple) indicate the contribution rate for the negative class.

Figure 2:

Examples of the visualization of shapelets for CBF time-series data. The colored lines show important patterns of the obtained classifier. Positive values on the colored lines (red to yellow) indicate the contribution rate for the positive class, and negative values (blue to purple) indicate the contribution rate for the negative class.

7.2  Results for Multiple-Instance Data

We selected the baselines of MIL algorithms as mi-SVM and MI-SVM (Andrews et al., 2003) and MILES (Chen, Bi, & Wang, 2006). mi-SVM and MI-SVM are classic methods in MIL that still perform favorably compared with state-of-the-art methods for standard multiple-instance data (see, e.g., Doran, 2015). The details of the data sets are shown in Table 4.

Table 4:
Details of MIL Data Sets.
Data SetSample SizeNumber of InstancesDimension
MUSK1 92 476 166 
MUSK2 102 6598 166 
elephant 200 1391 230 
fox 200 1320 230 
tiger 200 1220 230 
Data SetSample SizeNumber of InstancesDimension
MUSK1 92 476 166 
MUSK2 102 6598 166 
elephant 200 1391 230 
fox 200 1320 230 
tiger 200 1220 230 
mi- and MI-SVM find a single but optimized shapelet u, which is not limited to the instance in the training sample. The classifiers obtained by these algorithms are formulated as
g(B)=maxxBu,Φ(x)=maxxBzPSαzK(z,x).
(7.1)
MILES finds the multiple-shapelets, but they are limited to the instances in the training sample. The classifier of MILES is formulated as follows:
g(B)=zPSwzmaxxBK(z,x).
(7.2)

We used the implementation provided by Doran4 for mi-SVM and MI-SVM. We combined the gaussian kernel with mi-SVM and MI-SVM. Parameter C was chosen from {1,10,100,1000,10,000}. For our method and MILES,5 we chose ν from {0.5,0.3,0.2,0.15,0.1}, and we used only the gaussian kernel. Furthermore, we chose σ from {0.005,0.01,0.05,0.1,0.5,1.0}. We use 100-means clustering with respect to each class to reduce PS. To avoid the randomness of k-means, we ran training 30 times and selected the model that achieved the best training accuracy. For efficiency, we demonstrated the weak learning problem OP 2. For all these algorithms, we estimated an optimal parameter set via 5-fold cross-validation. We used well-known multiple-instance data, as shown on the left-hand side of Table 5. The accuracies resulted from 10 times of 5-fold cross-validation.

Table 5:
Classification Accuracies for MIL Data Sets.
Data Set mi-SVM MI-SVM MILES Ours 
MUSK1 0.834±0.084 0.820±0.081 0.865±0.068 0.844±0.076 
MUSK2 0.749±0.082 0.840±0.074 0.871±0.072 0.879±0.067 
elephant 0.785±0.070 0.823±0.056 0.796±0.068 0.828±0.061 
fox 0.618±0.069 0.578±0.075 0.675±0.071 0.646±0.063 
tiger 0.752±0.078 0.815±0.055 0.827±0.057 0.817±0.058 
Data Set mi-SVM MI-SVM MILES Ours 
MUSK1 0.834±0.084 0.820±0.081 0.865±0.068 0.844±0.076 
MUSK2 0.749±0.082 0.840±0.074 0.871±0.072 0.879±0.067 
elephant 0.785±0.070 0.823±0.056 0.796±0.068 0.828±0.061 
fox 0.618±0.069 0.578±0.075 0.675±0.071 0.646±0.063 
tiger 0.752±0.078 0.815±0.055 0.827±0.057 0.817±0.058 

Note: The best accuracies are highlighted in bold.

The results are shown in Table 5. MILES and Ours achieve significantly better performance than mi- and MI-SVM. Ours achieves comparable performance to MILES. Table 6 shows the training accuracies of MILES and Ours. It can be seen that Ours achieves higher training accuracy. This result is theoretically reasonable because our hypothesis class is richer than that of MILES. However, this means that Ours has a higher overfitting risk than does MILES.

Table 6:
Training Accuracies for MIL Data Sets.
Data SetMILESOurs
MUSK1 0.987 0.985 
MUSK2 0.980 0.993 
elephant 0.963 0.993 
fox 0.987 0.995 
tiger 0.973 0.993 
Data SetMILESOurs
MUSK1 0.987 0.985 
MUSK2 0.980 0.993 
elephant 0.963 0.993 
fox 0.987 0.995 
tiger 0.973 0.993 

Table 7 shows the training time of the five methods. It is clear that MILES and Ours are more efficient than mi- and MI-SVM. The main reason is that mi- and MI-SVM solve quadratic programming (QP) problems, while MILES and Ours solve LP problems. On average, MILES worked more efficiently than Ours. However, for MUSK2, which has a large number of instances, Ours worked more efficiently than MILES.

Table 7:
Training Time (Sec.) for MIL Data Sets.
Data Setmi-SVMMI-SVMMILESOurs
MUSK1 29.6 28.1 0.584 5.57 
MUSK2 3760.1 3530.0 103.1 80.5 
elephant 240.6 130.3 5.84 8.30 
fox 201.9 139.2 5.4 26.4 
tiger 158.5 118.0 4.6 9.8 
Data Setmi-SVMMI-SVMMILESOurs
MUSK1 29.6 28.1 0.584 5.57 
MUSK2 3760.1 3530.0 103.1 80.5 
elephant 240.6 130.3 5.84 8.30 
fox 201.9 139.2 5.4 26.4 
tiger 158.5 118.0 4.6 9.8 

The testing time of each algorithm is shown in Table 8. We can see that Ours is comparable to the other algorithms.

Table 8:
Testing Time (Sec.) for MIL Data Sets.
Data Setmi-SVMMI-SVMMILESOurs
MUSK1 0.010 0.004 0.011 0.045 
MUSK2 0.577 0.063 0.129 0.083 
elephant 0.053 0.015 0.067 0.115 
fox 0.078 0.025 0.118 0.145 
tiger 0.059 0.012 0.065 0.118 
Data Setmi-SVMMI-SVMMILESOurs
MUSK1 0.010 0.004 0.011 0.045 
MUSK2 0.577 0.063 0.129 0.083 
elephant 0.053 0.015 0.067 0.115 
fox 0.078 0.025 0.118 0.145 
tiger 0.059 0.012 0.065 0.118 

8  Conclusion and Future Work

We proposed a new MIL formulation that provides a richer class of the final classifiers based on infinitely many shapelets. We derived the tractable formulation over infinitely many shapelets with theoretical support and provided an algorithm based on LPBoost and the DC (difference of convex) algorithm. Our result gives theoretical justification for some existing shapelet-based classifiers (e.g., Chen et al., 2006; Hills et al., 2014). The experimental results demonstrate that our approach uniformly works for SL and MIL tasks without introducing domain-specific parameters and heuristics and compares with the baselines of shapelet-based classifiers.

Especially for time-series classification, the number of instances usually becomes large. Although we took a heuristic approach in the experiment, we think it is not an essential solution to improve efficiency. We preliminarily implemented OP 1 with orthogonal random features (Yu et al., 2016) that can approximate the gaussian kernel accurately. It allows us to solve the primal problem of OP 1 directly and to avoid constructing a large kernel matrix. The implementation vastly improved the efficiency however, it did not achieve high accuracy as compared with solutions of OP 2 with the heuristics. For SL tasks, there are many successful efficient methods using some heuristics specialized in the time-series domain (Keogh & Rakthanmanon, 2013; Renard et al., 2015; Grabocka et al., 2015; Wistuba et al., 2015; Hou et al., 2016; Karlsson et al., 2016). We will explore many ways to improve efficiency for SL tasks.

We would also like to improve the generalization error bound. The generalization error bound that we provided in this letter is incomparable to the existing bound. We would like to show the tighter bound than the existing bound. Since we think it requires a more complex analysis, we reserve this for future work. Our heuristics might reduce the model complexity (i.e., the risk of overfitting); however, we do not know how the complexity can be reduced by our heuristics theoretically. To apply our method to various domains, we would like to explore the general techniques for reducing the overfitting risk of our method.

Appendix A:  Proof of Theorem 1

First, we give a definition for convenience.

Definition 1
(The set Θ of mappings from a bag to an instance). Given a sample S=(B1,,Bm). For any uU, let θu,Φ:{B1,,Bm}X be a mapping defined by
θu,Φ(Bi):=argmaxxBiu,Φx,
and we define the set of all θu,Φ for S as ΘS,Φ={θu,ΦuU}. For the sake of brevity, θu,Φ and ΘS,Φ will be abbreviated as θu and Θ, respectively.

Following is the proof of theorem 1.

Proof.
We can rewrite the optimization problem 3.3 by using θΘ as follows:
maxθΘmaxuH:θu=θi=1myidiu,Φθ(Bi)sub.touH21.
(A.1)
Thus, if we fix θΘ, we have a subproblem. Since the constraint θ=θu can be written as the number |PS| of linear constraints (i.e., sub.to u,Φ(x)u,Φ(θ(Bi))(i[m],xBi)), each subproblem is equivalent to a convex optimization. Indeed, each subproblem can be written as the equivalent unconstrained minimization (by neglecting constants in the objective),
minuHβuH2-i=1mxBiηi,xu,Φθ(Bi)-u,Φ(x)-i=1myidiu,Φθ(Bi),
where β and ηi,x(i[m],xBi) are the corresponding positive constants. Now for each subproblem, we can apply the standard representer theorem argument (see, e.g., Mohri et al., 2012). Let H1 be the subspace {uHu=zPSαzΦ(z),αzR}. We denote u1 as the orthogonal projection of u onto H1, and any uH has the decomposition u=u1+u. Since u is orthogonal with regard to H1, uH2=u1H2+uH2u1H2. On the other hand, u,Φz=u1,Φz. Therefore, the optimal solution of each subproblem has to be contained in H1. This implies that the optimal solution, which is the maximum over all solutions of subproblems, is contained in H1 as well.

Appendix B:  Proof of Theorem 2

We use θ and Θ of definition 4.

Definition 2

(The Rademacher and the Gaussian complexity; Bartlett & Mendelson, 2003). Given a sample S=(x1,,xm)Xm, the empirical Rademacher complexity R(H) of a class H{h:XR} with regard to S is defined as RS(H)=1mEσsuphHi=1mσih(xi), where σ{-1,1}m, and each σi is an independent uniform random variable in {-1,1}. The empirical gaussian complexity GS(H) of H with regard to S is defined similarly, but each σi is drawn independently from the standard normal distribution.

The following bounds are well known:

Lemma 1

(Lemma 9 of Bartlett & Mendelson, 2003). RS(H)=O(GS(H)).

Lemma 2
(Corollary 6.1 of Mohri et al., 2012). For fixed ρ, δ>0, the following bound holds with probability at least 1-δ: for all fconv(H),
ED(f)Eρ(f)+2ρRS(H)+3log1δ2m.

To derive a generalization bound based on the Rademacher or the gaussian complexity is quite standard in the statistical learning theory literature and applicable to our classes of interest as well. However, a standard analysis provides us suboptimal bounds.

Lemma 3.
Suppose that for any zX, Φ(z)HR. Then the empirical gaussian complexity of HU with respect to S for U{uuH1} is bounded as follows:
GS(H)R(2-1)+2(ln|Θ|)m.
Proof.
Since U can be partitioned into θΘ{uUθu=θ},
GS(HU)=1mEσsupθΘsupuU:θu=θi=1mσiu,Φθ(Bi)=1mEσsupθΘsupuU:θu=θu,i=1mσiΦθ(Bi)1mEσsupθΘsupuUu,i=1mσiΦθ(Bi)1mEσsupθΘi=1mσiΦθ(Bi)H=1mEσsupθΘi=1mσiΦθ(Bi)H2=1mEσsupθΘi=1mσiΦθ(Bi)H21mEσsupθΘi=1mσiΦθ(Bi)H2.
(B.1)
The first inequality is derived from the relaxation of u, the second inequality is due to Cauchy-Schwarz inequality and the fact uH1, and the last inequality is due to Jensen's inequality. We denote by K(θ) the kernel matrix such that Kij(θ)=Φ((θ(Bi)),Φ(θ(Bj)). Then we have
EσsupθΘi=1mσiΦθ(Bi)H2=EσsupθΘi,j=1mσiσjKij(θ).
(B.2)
We now derive an upper bound of the right-hand side as follows.
For any c>0,
expcEσsupθΘi,j=1mσiσjKij(θ)EσexpcsupθΘi,j=1mσiσjKij(θ)=EσsupθΘexpci,j=1mσiσjKij(θ)θΘEσexpci,j=1mσiσjKij(θ).
The first inequality is due to Jensen's inequality, and the second inequality is due to the fact that the supremum is bounded by the sum. By using the symmetry property of K(θ), we have i,j=1mσiσjKij(θ)=σK(θ)σ, which is rewritten as
σK(θ)σ=(Vσ)λ1(θ)00λm(θ)Vσ,
where λ1(θ)λm(θ)0 are the eigenvalues of K(θ) and V=(v1,,vm) is the orthonormal matrix such that vi is the eigenvector that corresponds to the eigenvalue λi. By the reproductive property of gaussian distribution, Vσ obeys the same gaussian distribution as well, so
θΘEσexpci,j=1mσiσjKij(θ)=θΘEσexpcσK(θ)σ=θΘEσexpck=1mλk(θ)(vkσ)2=θΘΠk=1mEσkexpcλk(θ)σk2(replaceσ=vkσ)=θΘΠk=1m-expcλk(θ)σ2exp(-σ2)2πdσ=θΘΠk=1m-exp(-(1-cλk(θ))σ2)2πdσ.
Now we replace σ by σ'=1-cλk(θ)σ. Since dσ'=1-cλk(θ)dσ, we have:
-exp(-(1-cλk(θ))σ2)2πdσ=12π-exp(-σ'2)1-cλk(θ)dσ'=11-cλk(θ).
Now, applying the inequality that 11-x1+2(2-1)x for 0x12, the bound becomes
expcEσsupθΘi,j=1mσiσjKij(θ)θΘΠk=1m1+2(2-1)cλk(θ)+2λ1.
(B.3)
Further, taking the logarithm, dividing the both sides by c, letting c=12maxkλk(θ)=1/(2λ1(θ)), fix θ=θ* such that θ* maximizes equation B.3, and applying ln(1+x)x, we get:
EσsupθΘi,j=1mσiσjKij(θ*)(2-1)k=1mλk(θ*)+2λ1(θ*)ln|Θ|=(2-1)(K(θ*))+2λ1(θ*)ln|Θ|(2-1)mR2+2mR2ln|Θ|,
(B.4)
where the last inequality holds since λ1(θ*)=K(θ*)2mK(θ)maxR2. By equations B.1 and B.4, we have
GS(H)1mEσsupθΘi,j=1mσiσjKij(θ)R(2-1)+2ln|Θ|m.

Thus, it suffices to bound the size |Θ|. The basic idea to get our bound is the following geometric analysis. Fix any i[m] and consider points {Φ(x)xBi}. Then we define equivalence classes of u such that θu(i) is in the same class, which defines a Voronoi diagram for the points {Φ(x)xBi}. Note here that the similarity is measured by the inner product, not a distance. More precisely, let {Vi(x)xBi} be the Voronoi diagram, with each region defined as Vi(x)={uHθu(Bi)=x} Let us consider the set of intersections i[m]Vi(xi) for all combinations of (x1,,xm)B1××Bm. The key observation is that each nonempty intersection corresponds to a mapping θuΘ. Thus, we obtain |Θ|=(thenumberofintersectionsi[m]Vi(xi)). In other words, the size of Θ is exactly the number of rooms defined by the intersections of m Voronoi diagrams V1,,Vm. From now on, we will derive the upper bound based on this observation.

Lemma 4.
|Θ|=O(|PS|2dΦ,S*).
Proof.

We will reduce the problem of counting intersections of the Voronoi diagrams to that of counting possible labelings by hyperplanes for some set. Note that for each neighboring Voronoi region, the border is a part of hyperplane since the closeness is defined in terms of the inner product. Therefore, by simply extending each border to a hyperplane, we obtain intersections of half-spaces defined by the extended hyperplanes. Note that the size of these intersections gives an upper bound of intersections of the Voronoi diagrams. More precisely, we draw hyperplanes for each pair of points in Φ(PS) so that each point on the hyperplane has the same inner product between two points. Note that for each pair Φ(z),Φ(z')PS, the normal vector of the hyperplane is given as Φ(z)-Φ(z') (by fixing the sign arbitrary). Thus, the set of hyperplanes obtained by this procedure is exactly Φdiff(PS). The size of Φdiff(PS) is |PS|2, which is at most |PS|2. Now, we consider a dual space by viewing each hyperplane as a point and each point in U as a hyperplane. Points u (hyperplanes in the dual) in an intersection give the same labeling on the points in the dual domain. Therefore, the number of intersections in the original domain is the same as the number of the possible labelings on Φdiff(PS) by hyperplanes in U. By the classical Sauer's lemma and the VC dimension of hyperplanes (see, e.g., theorem 5.5 in Schölkopf & Smola, 2002), the size is at most O((|PS|2)dΦ,S*).

Theorem 3.

  • For any Φ, |Θ|=O(|PS|8(R/μ*)2).

  • If XR and Φ is the identity mapping over PS, then |Θ|=O(|PS|min{8(R/μ*)2,2}}).

  • If XR and Φ satisfies that Φ(z),Φ(x) is monotone decreasing with respect to z-x2 (e.g., the mapping defined by the gaussian kernel) and U={Φ(z)zXR,Φ(z)H1}, then |Θ|=O(|PS|min{8(R/μ*)2,2}}).

Proof.

(i) We follow the argument in lemma 9. For the set of classifiers F={f:Φdiff(PS){-1,1}f=sign(u,v),uH1,minvΦdiff(PS)|u,v|=μ}, its VC dimension is known to be at most R2/μ2 for Φdiff(PS){vvH2R} (see, e.g., Schölkopf & Smola, 2002). By the definition of μ*, for each intersection given by hyperplanes, there always exists a point u whose inner product between each hyperplane is at least μ*. Therefore, the size of the intersections is bounded by the number of possible labelings in the dual space by U''={uHuH1,minvΦdiff(PS)|u,v|=μ*}. Thus, we obtain that dΦ,S* is at most 8(R/μ*)2, and by lemma 9, we complete the proof of case i.

(ii) In this case, the Hilbert space H is contained in R. Then, by the fact that VC dimension dΦ,S* is at most and lemma 9, the statement holds.

(iii) If Φ(z),Φ(x) is monotone decreasing for z-x, then the following holds:
argmaxxXΦ(z),Φ(x)=argminxXz-x2.
Therefore, maxu:uH=1u,Φ(x)=Φ(x)H, where u=Φ(x)Φ(x)H. It indicates that the number of Voronoi cells made by V(x)={zRz=argmaxxB(z·x)} corresponds to the V^(x)={Φ(z)Hz=argmaxxBΦ(z),Φ(x)}. Then, by following the same argument for the linear kernel case, we get the same statement.

Now we are ready to prove theorem 2.

Proof of Theorem 2.

By using lemmas 6, and 7, we obtain the generalization bound in terms of the gaussian complexity of H. Then, by applying lemma 8 and theorem 10, we complete the proof.

B.1  Hyperparameter Tuning for Time-Series Classification

In the experiment for time-series classification, we roughly tuned ν and σ2 of the gaussian kernel. As we mentioned before, we need high computation time when learning very large time series. The main computational cost is to iteratively solve weak learning problems by using an LP (or QP) solver. The number of constraints of the optimization problem 5.1 depends on the total number of instances in negative bags. Therefore, in the hyperparameter tuning phase, we finish solving each weak learning problem by obtaining the solution of the optimization problem 6.1. Using the rough weak learning problem, we tuned ν and σ through a grid search via three runs of 3-fold cross-validation.

Notes

1

Although there are settings where instance label prediction is also considered, we focus only on bag-label prediction in this letter.

2

Note that our method is applicable to multiclass classification tasks by easy expansion (e.g., Platt, Cristianini, & Shawe-Taylor, 2000).

5

MILES uses 1-norm SVM to obtain a final classifier. We implemented 1-norm SVM by using the formulation of Warmuth, Glocer, and Rätsch (2008).

Acknowledgments

This work is supported by JST CREST (grant JPMJCR15K5) and JSPS KAKENHI (grant JP18K18001). In the experiments, we used the computer resource offered under the category of General Projects by the Research Institute for Information Technology, Kyushu University.

References

Andrews
,
S.
, &
Hofmann
,
T.
(
2004
). Multiple instance learning via disjunctive programming boosting. In
S.
Thrun
,
L. K.
Saul
, &
B.
Schölkopf
(Eds.),
Advances in neural information processing systems
,
16
(pp.
65
72
).
Cambridge, MA
:
MIT Press
.
Andrews
,
S.
,
Tsochantaridis
,
I.
, &
Hofmann
,
T.
(
2003
). Support vector machines for multiple-instance learning. In
S.
Becker
,
S.
Thrun
, &
K.
Obermayer
(Eds.),
Advances in neural information processing systems
,
15
(pp.
577
584
).
Cambridge, MA
:
MIT Press
.
Auer
,
P.
, &
Ortner
,
R.
(
2004
).
A boosting approach to multiple instance learning.
Lecture Notes in Computer Science: Vol. 3201. Proceedings of the European Conference on Machine Learning
(pp.
63
74
).
Berlin
:
Springer
.
Bagnall
,
A.
,
Lines
,
J.
,
Bostrom
,
A.
,
Large
,
J.
, &
Keogh
,
E.
(
2017
).
The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances
.
Data Mining and Knowledge Discovery
,
31
(
3
),
606
660
.
Bartlett
,
P. L.
, &
Mendelson
,
S.
(
2003
).
Rademacher and gaussian complexities: Risk bounds and structural results
.
Journal of Machine Learning Research
,
3
,
463
482
.
Carbonneau
,
M.-A.
,
Cheplygina
,
V.
,
Granger
,
E.
, &
Gagnon
,
G.
(
2018
).
Multiple instance learning: A survey of problem characteristics and applications
.
Pattern Recognition
,
77
,
329
353
.
Chen
,
Y.
,
Bi
,
J.
, &
Wang
,
J. Z.
(
2006
).
MILES: Multiple-instance learning via embedded instance selection
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
28
(
12
),
1931
1947
.
Chen
,
Y.
,
Keogh
,
E.
,
Hu
,
B.
,
Begum
,
N.
,
Bagnall
,
A.
,
Mueen
,
A.
, &
Batista
,
G.
(
2015
).
The UCR time series classification archive.
www.cs.ucr.edu/∼eamonn/timeseries_data/.
Cuturi
,
M.
(
2011
). Fast global alignment kernels. In
Proceedings of the International Conference on Machine Learning
(pp.
929
936
).
New York
:
ACM
.
Demiriz
,
A.
,
Bennett
,
K. P.
, &
Shawe-Taylor
,
J.
(
2002
).
Linear programming boosting via column generation
.
Machine Learning
,
46
(
1–3
),
225
254
.
Dietterich
,
T. G.
,
Lathrop
,
R. H.
, &
Lozano-Pérez
,
T.
(
1997
).
Solving the multiple instance problem with axis-parallel rectangles
.
Artificial Intelligence
,
89
(
1–2
),
31
71
.
Doran
,
G.
(
2015
).
Multiple instance learning from distributions
. PhD diss., Case Western Reserve University.
Doran
,
G.
, &
Ray
,
S.
(
2014
).
A theoretical and empirical analysis of support vector machine methods for multiple-instance classification
.
Machine Learning
,
97
(
1–2
),
79
102
.
Gärtner
,
T.
,
Flach
,
P. A.
,
Kowalczyk
,
A.
, &
Smola
,
A. J.
(
2002
).
Multi-instance kernels.
In
Proceedings of the International Conference on Machine Learning
(pp.
179
186
).
New York
:
ACM
.
Grabocka
,
J.
,
Schilling
,
N.
,
Wistuba
,
M.
, &
Schmidt-Thieme
,
L.
(
2014
).
Learning time-series shapelets.
In
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
392
401
).
New York
:
ACM
.
Grabocka
,
J.
,
Wistuba
,
M.
, &
Schmidt-Thieme
,
L.
(
2015
).
Scalable discovery of time-series shapelets.
CoRR
,
abs/1503.03238
.
Hills
,
J.
,
Lines
,
J.
,
Baranauskas
,
E.
,
Mapp
,
J.
, &
Bagnall
,
A.
(
2014
).
Classification of time series by shapelet transformation
.
Data Mining and Knowledge Discovery
,
28
(
4
),
851
881
.
Hou
,
L.
,
Kwok
,
J. T.
, &
Zurada
,
J. M.
(
2016
).
Efficient learning of timeseries shapelets.
In
Proceedings of the AAAI Conference on Artificial Intelligence
(pp.
1209
1215
).
Palo Alto, CA
:
AAAI Press
.
Karlsson
,
I.
,
Papapetrou
,
P.
, &
Boström
,
H.
(
2016
).
Generalized random shapelet forests
.
Data Mining and Knowledge Discovery
,
30
(
5
),
1053
1085
.
Keogh
,
E. J.
, &
Rakthanmanon
,
T.
(
2013
).
Fast shapelets: A scalable algorithm for discovering time series shapelets.
In
Proceedings of the International Conference on Data Mining
(pp.
668
676
).
Philadelphia
:
Society for Industrial and Applied Mathematics
.
Le Thi
,
H. A.
, & Pham
Dinh
,
T.
(
2018
).
DC programming and DCA: Thirty years of developments
.
Mathematical Programming
,
169
(
1
),
5
68
.
Löning
,
M.
,
Bagnall
,
A.
,
Ganesh
,
S.
,
Kazakov
,
V.
,
Lines
,
J.
, &
Király
,
F. J.
(
2019
).
sktime: A unified interface for machine learning with time series
.
arXiv:1909.07872
.
Mohri
,
M.
,
Rostamizadeh
,
A.
, &
Talwalkar
,
A.
(
2012
).
Foundations of machine learning
.
Cambridge, MA
:
MIT Press
.
Platt
,
J. C.
,
Cristianini
,
N.
, &
Shawe-Taylor
,
J.
(
2000
). Large margin DAGs for multiclass classification. In
S. A.
Solla
,
T. K.
Leen
, &
K.
Müller
(Eds.),
Advances in neural information processing systems
,
12
(pp.
547
553
).
Cambridge, MA
:
MIT Press
.
Renard
,
X.
,
Rifqi
,
M.
,
Erray
,
W.
, &
Detyniecki
,
M.
(
2015
).
Random-shapelet: An algorithm for fast shapelet discovery.
In
Proceedings of the IEEE International Conference on Data Science and Advanced Analytics
(pp.
1
10
).
Piscataway, NJ
:
IEEE
.
Sabato
,
S.
, &
Tishby
,
N.
(
2012
).
Multi-instance learning with any hypothesis class
.
Journal of Machine Learning Research
,
13
(
1
),
2999
3039
.
Sangnier
,
M.
,
Gauthier
,
J.
, &
Rakotomamonjy
,
A.
(
2016
).
Early and reliable event detection using proximity space representation.
In
Proceedings of the International Conference on Machine Learning
(pp.
2310
2319
).
Schökopf
,
B.
, &
Smola
,
A.
(
2002
).
Learning with kernels: Support vector machines, regularization, optimization, and beyond
.
Cambridge, MA
:
MIT Press
.
Shapiro
,
A.
(
2009
).
Semi-infinite programming, duality, discretization and optimality conditions
.
Optimization
,
58
(
2
),
133
161
.
Shimodaira
,
H.
,
Noma
,
K.-i.
,
Nakai
,
M.
, &
Sagayama
,
S.
(
2001
).
Dynamic time-alignment kernel in support vector machine.
In
Proceedings of the International Conference on Neural Information Processing Systems
(pp.
921
928
).
Cambridge, MA
:
MIT Press
.
Tao
,
P. D.
, &
Souad
,
E. B.
(
1988
). Duality in D.C. (difference of convex functions) optimization. Subgradient methods. In
K.-H.
Hoffmann
,
J.
Zowe
,
J.-B.
Hiriat-Urruty
, &
C.
Lemarechal
(Eds.),
Trends in mathematical optimization
(pp.
277
293
).
Berlin
:
Springer
.
Tavenard
,
R.
,
Faouzi
,
J.
, &
Vandewiele
,
G.
(
2017
).
tslearn: A machine learning toolkit dedicated to time-series data.
https://github.com/rtavenar/tslearn.
Warmuth
,
M.
,
Glocer
,
K.
, &
Rätsch
,
G.
(
2008
). Boosting algorithms for maximizing the soft margin. In
J. C.
Platt
,
D.
Köller
,
Y.
Singer
, &
S. T.
Rowweis
(Eds.),
Advances in neural information processing systems
,
20
(pp.
1585
1592
).
Cambridge, MA
:
MIT Press
.
Wistuba
,
M.
,
Grabocka
,
J.
, &
Schmidt-Thieme
,
L.
(
2015
).
Ultra-fast shapelets for time series classification.
CoRR
,
abs/1503.05018
.
Ye
,
L.
, &
Keogh
,
E.
(
2009
).
Time series shapelets: A new primitive for data mining.
In
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
947
956
).
New York
:
ACM
.
Yu
,
C.-N. J.
, &
Joachims
,
T.
(
2009
).
Learning structural SVMs with latent variables.
In
Proceedings of the International Conference on Machine Learning
(pp.
1169
1176
).
Omnipress
.
Yu
,
F. X. X.
,
Suresh
,
A. T.
,
Choromanski
,
K. M.
,
Holtmann-Rice
,
D. N.
, &
Kumar
,
S.
(
2016
). Orthogonal random features. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
1975
1983
).
Red Hook, NY
:
Curran
.
Zhang
,
C.
,
Platt
,
J. C.
, &
Viola
,
P. A.
(
2006
). Multiple instance boosting for object detection. In
Y.
Weiss
,
B.
Scholköpf
, &
J. C.
Platt
(Eds.),
Advances in neural information processing systems
,
18
(pp.
1417
1424
).
Cambridge, MA
:
MIT Press
.
Zhang
,
D.
,
He
,
J.
,
Si
,
L.
, &
Lawrence
,
R.
(
2013
).
MILEAGE: Multiple instance learning with global embedding.
In
Proceedings of the International Conference on Machine Learning
(pp.
82
90
).
Omnipress
.